Understanding the issue:

  1. What is the main motivation of this project? The goal is to predict the number of applications for admission to American colleges based on 777 observations of 17 independent random variables. The purpose of the problem is to fit an optimal model to the data to better predict the variable named “Apps”.

  2. What can the output of such a project be used for? Predicting the variable of the number of applications for admission to universities can be used in future educational planning and a better understanding of the variables affecting the number of applications of applicants, for example:

    • What variables have the greatest impact on the number of applications and examine the possibility of strengthening them in future educational planning.

    • What variables have the least impact on the number of applications for admission to universities.

    • What variables are not effective or, in other words, independent of “Apps” and examine the possibility of omitting them.

  3. Who might be interested in the results of this project?

    • Educational regulatory organization such as the Ministry of Science,
    • Universities and educational centers,
    • Research organizations in the field of education.

Understanding data:

  1. Where did the data come from and how was it collected? This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section’s 1995 Data Analysis Exposition.

  2. What do each of the variables measure?

    • Private: A factor with levels No and Yes indicating private or public university
    • Apps: Number of applications received
    • Accept: Number of applications accepted
    • Enroll: Number of new students enrolled
    • Top10perc: Pct. new students from top 10% of H.S. class
    • Top25perc: Pct. new students from top 25% of H.S. class
    • F.Undergrad: Number of fulltime undergraduates
    • P.Undergrad: Number of parttime undergraduates
    • Outstate: Out-of-state tuition
    • Room.Board: Room and board costs
    • Books: Estimated book costs
    • Personal: Estimated personal spending
    • PhD:Pct. of faculty with Ph.D.’s
    • Terminal: Pct. of faculty with terminal degree
    • S.F.Ratio:Student/faculty ratio
    • perc.alumni:Pct. alumni who donate
    • Expend:Instructional expenditure per student
    • Greatad.Rate:Graduation rate
  3. Is there any ambiguity in the definitions of the data? Yes, there is some ambiguity in the definition or calculation of some variables such as the perc. alumni, Grad.Rate, Student-to-Faculty Ratio, and S.F.Ratio.

  4. Is there an error in measuring variables or recording data? Yes, for various reasons such as:

    • Lack of equal definitions or different interpretations of a definition by different people who are responsible for collecting data in different universities.
    • Error in data entry by the operator.
  5. What other variables, if any, could help solve the problem? Independent variables such as:

    • Number of courses available in the university
    • The number of outstanding professors in each field based on the number of articles cited in reliable scientific journals.
    • International University Rank
  6. What are the existing variables (categorical-numerical)? Except for one variable named ”private”, which is a binary variable (discrete categorical variable), other variables are numerical.


Data preparation & Modeling

Required Libraries

require(moments)           #Moments, skewness, kurtosis and related tests
require(corrplot)          #Visualization of Correlation Matrix
require(car)               #Required for Calculations Related to Regression
require(ggplot2)           #Create Elegant Data Visualizations
require(MASS)              #Box-Cox Transformations for Linear Models
require(leaps)             #Regression Subset Selection
require(glmnet)            #Lasso and Elastic-Net Regularized GLM
require(rpart)             #Classification and Regression Trees 
require(rpart.plot)        #Plot Decision Tree
require(randomForest)      #Random Forests for Classification and Regression 
require(gbm)               #Boosting Methods in Regression Problems
require(xgboost)           #Boosting Methods in Regression Problems

Read Data From File

Statistical Summary

data<-read.csv("/Volumes/USB DISK/project/college.csv",header = TRUE)
class(data)
## [1] "data.frame"
dim(data)
## [1] 777  19
head(data,2)
##                   College.Name Private Apps Accept Enroll Top10perc Top25perc
## 1 Abilene Christian University     Yes 1660   1232    721        23        52
## 2           Adelphi University     Yes 2186   1924    512        16        29
##   F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## 1        2885         537     7440       3300   450     2200  70       78
## 2        2683        1227    12280       6450   750     1500  29       30
##   S.F.Ratio perc.alumni Expend Grad.Rate
## 1      18.1          12   7041        60
## 2      12.2          16  10527        56
tail(data,2)
##                     College.Name Private  Apps Accept Enroll Top10perc
## 776              Yale University     Yes 10705   2453   1317        95
## 777 York College of Pennsylvania     Yes  2989   1855    691        28
##     Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 776        99        5217          83    19840       6510   630     2115  96
## 777        63        2988        1726     4990       3560   500     1250  75
##     Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 776       96       5.8          49  40386        99
## 777       75      18.1          28   4509        99
  1. Univariate View:

College.Name

length(unique(data$College.Name)) 
## [1] 777
# No duplicate data 

Private

class(data$Private) 
## [1] "character"
data$Private<-factor(data$Private,levels = c("Yes","No")) #Convert to factor
levels(data$Private)<-c("Private","Not Private") #Rename factor levels
levels(data$Private)
## [1] "Private"     "Not Private"
summary(data$Private) 
##     Private Not Private 
##         565         212
##Distribution of Private
table(data$Private, useNA = "ifany") 
## 
##     Private Not Private 
##         565         212
# No missing value & number of private university is twice of number of public university.
barplot(table(data$Private),main = "distribution of state of university ", col ="Dark Blue" )

Apps (Dependent variable)

summary(data$Apps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      81     776    1558    3002    3624   48094
which(data$Apps==48094)
## [1] 484
#Frequency of max(data$Apps)=1, then may be it entered incorrectly.
#checking missing value of Apps:
sum(is.na(data$Apps))
## [1] 0
# No missing value
#Distribution of Apps
#1.histogram of Apps
hist(data$Apps,xlab="Number of applications received",main=paste("Histogram of", "App"))

#2.boxplot of Apps
boxplot(data$Apps,main="box Plot Apps")

boxplot(Apps~Private,data=data,main="box Plot of Apps according to type of university",xlab="Type of university")

#3.QQ-plot of Apps
qqnorm(data$Apps, main="qqplot of Apps",pch=20)
qqline(data$Apps,col="red")

#4Jarque-Bera Test(Skewness=0) and Anscombe-glynn Test(kurtosis=3)) for Apps
library(moments)
jarque.test(data$Apps)
## 
##  Jarque-Bera Normality Test
## 
## data:  data$Apps
## JB = 24687, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(data$Apps)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  data$Apps
## kurt = 29.595, z = 15.195, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#conclusion:Reject normality assumption for Apps distribution.

Other continuous independent variable

#checking missing value:
for (i in c(4:19)) print(c(names(data)[i],NumberofMissingValue=sum(is.na(data)[i])))
##                      NumberofMissingValue 
##             "Accept"                  "0" 
##                      NumberofMissingValue 
##             "Enroll"                  "0" 
##                      NumberofMissingValue 
##          "Top10perc"                  "0" 
##                      NumberofMissingValue 
##          "Top25perc"                  "0" 
##                      NumberofMissingValue 
##        "F.Undergrad"                  "0" 
##                      NumberofMissingValue 
##        "P.Undergrad"                  "0" 
##                      NumberofMissingValue 
##           "Outstate"                  "0" 
##                      NumberofMissingValue 
##         "Room.Board"                  "0" 
##                      NumberofMissingValue 
##              "Books"                  "0" 
##                      NumberofMissingValue 
##           "Personal"                  "0" 
##                      NumberofMissingValue 
##                "PhD"                  "0" 
##                      NumberofMissingValue 
##           "Terminal"                  "0" 
##                      NumberofMissingValue 
##          "S.F.Ratio"                  "0" 
##                      NumberofMissingValue 
##        "perc.alumni"                  "0" 
##                      NumberofMissingValue 
##             "Expend"                  "0" 
##                      NumberofMissingValue 
##          "Grad.Rate"                  "0"
# No missing value
#distribution of Other continuous independent variable
#1.histogram of Other continuous independent variable
par(mar=c(2,2,2,2))
par(mfrow=c(4,4)) #4 rows and 4 columns
for(i in c(4:19)){
  hist(data[,i],main=paste("Histogram of", names(data)[i]),col="Dark Blue")}

#2.boxplot of Other continuous independent variable
for(i in c(4:19)){
  boxplot(data[,i],main=paste("box Plot of",names(data)[i]))
}

#3.QQ-plot of Other continuous independent variable
for(i in c(4:19)){
  qqnorm(data[,i],main=paste("qqplot of",names(data)[i]),pch=20)
  qqline(data[,i],col="red")}

par(mfrow=c(1,1))
#4Jarque-Bera Test(Skewness=0)  for Other continuous independent variable
library(moments)
sapply(data[,4:19],jarque.test)
##             Accept                       Enroll                      
## statistic   12960.1                      3422.191                    
## p.value     0                            0                           
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             Top10perc                    Top25perc                   
## statistic   412.3679                     19.12887                    
## p.value     0                            7.018076e-05                
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             F.Undergrad                  P.Undergrad                 
## statistic   2768.508                     100954.3                    
## p.value     0                            0                           
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             Outstate                     Room.Board                  
## statistic   39.13872                     30.61429                    
## p.value     3.170551e-09                 2.25005e-07                 
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             Books                        Personal                    
## statistic   27209.39                     2010.193                    
## p.value     0                            0                           
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             PhD                          Terminal                    
## statistic   86.03721                     87.76365                    
## p.value     0                            0                           
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             S.F.Ratio                    perc.alumni                 
## statistic   265.8505                     47.86244                    
## p.value     0                            4.043932e-11                
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"                    
##             Expend                       Grad.Rate                   
## statistic   12796.29                     3.119798                    
## p.value     0                            0.2101573                   
## alternative "greater"                    "greater"                   
## method      "Jarque-Bera Normality Test" "Jarque-Bera Normality Test"
## data.name   "X[[i]]"                     "X[[i]]"
#Reject Normality assumption for all the varialbes, except variable named "Grad.Rate".
#Anscombe-glynn Test(kurtosis=3)) for Other continuous independent variable
sapply(data[,4:19],anscombe.test)
##             Accept                         Enroll                        
## statistic   numeric,2                      numeric,2                     
## p.value     9.93411e-46                    1.425782e-31                  
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             Top10perc                      Top25perc                     
## statistic   numeric,2                      numeric,2                     
## p.value     4.357408e-11                   6.971441e-06                  
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             F.Undergrad                    P.Undergrad                   
## statistic   numeric,2                      numeric,2                     
## p.value     4.030631e-29                   8.73106e-65                   
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             Outstate                       Room.Board                    
## statistic   numeric,2                      numeric,2                     
## p.value     0.003368399                    0.2678832                     
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             Books                          Personal                      
## statistic   numeric,2                      numeric,2                     
## p.value     3.53884e-53                    9.053454e-28                  
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             PhD                            Terminal                      
## statistic   numeric,2                      numeric,2                     
## p.value     0.007727027                    0.1814422                     
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             S.F.Ratio                      perc.alumni                   
## statistic   numeric,2                      numeric,2                     
## p.value     1.038581e-12                   0.6156338                     
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"                      
##             Expend                         Grad.Rate                     
## statistic   numeric,2                      numeric,2                     
## p.value     1.453228e-45                   0.2175176                     
## alternative "kurtosis is not equal to 3"   "kurtosis is not equal to 3"  
## method      "Anscombe-Glynn kurtosis test" "Anscombe-Glynn kurtosis test"
## data.name   "X[[i]]"                       "X[[i]]"
#Reject Normality assumption for all the varialbes, except variables named "Room.Board" and "Terminal" and "Grad.Rate" and "perc.alumni".
#conclusion:Reject Normality assumption for all the varialbes, except variable named "Grad.Rate".
  1. correlation Analysis
library(corrplot)
cor_table<- round(cor(data[,c(3,4:19)]),3)
cor_table
##               Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## Apps         1.000  0.943  0.847     0.339     0.352       0.814       0.398
## Accept       0.943  1.000  0.912     0.192     0.247       0.874       0.441
## Enroll       0.847  0.912  1.000     0.181     0.227       0.965       0.513
## Top10perc    0.339  0.192  0.181     1.000     0.892       0.141      -0.105
## Top25perc    0.352  0.247  0.227     0.892     1.000       0.199      -0.054
## F.Undergrad  0.814  0.874  0.965     0.141     0.199       1.000       0.571
## P.Undergrad  0.398  0.441  0.513    -0.105    -0.054       0.571       1.000
## Outstate     0.050 -0.026 -0.155     0.562     0.489      -0.216      -0.254
## Room.Board   0.165  0.091 -0.040     0.371     0.331      -0.069      -0.061
## Books        0.133  0.114  0.113     0.119     0.116       0.116       0.081
## Personal     0.179  0.201  0.281    -0.093    -0.081       0.317       0.320
## PhD          0.391  0.356  0.331     0.532     0.546       0.318       0.149
## Terminal     0.369  0.338  0.308     0.491     0.525       0.300       0.142
## S.F.Ratio    0.096  0.176  0.237    -0.385    -0.295       0.280       0.233
## perc.alumni -0.090 -0.160 -0.181     0.455     0.418      -0.229      -0.281
## Expend       0.260  0.125  0.064     0.661     0.527       0.019      -0.084
## Grad.Rate    0.147  0.067 -0.022     0.495     0.477      -0.079      -0.257
##             Outstate Room.Board  Books Personal    PhD Terminal S.F.Ratio
## Apps           0.050      0.165  0.133    0.179  0.391    0.369     0.096
## Accept        -0.026      0.091  0.114    0.201  0.356    0.338     0.176
## Enroll        -0.155     -0.040  0.113    0.281  0.331    0.308     0.237
## Top10perc      0.562      0.371  0.119   -0.093  0.532    0.491    -0.385
## Top25perc      0.489      0.331  0.116   -0.081  0.546    0.525    -0.295
## F.Undergrad   -0.216     -0.069  0.116    0.317  0.318    0.300     0.280
## P.Undergrad   -0.254     -0.061  0.081    0.320  0.149    0.142     0.233
## Outstate       1.000      0.654  0.039   -0.299  0.383    0.408    -0.555
## Room.Board     0.654      1.000  0.128   -0.199  0.329    0.375    -0.363
## Books          0.039      0.128  1.000    0.179  0.027    0.100    -0.032
## Personal      -0.299     -0.199  0.179    1.000 -0.011   -0.031     0.136
## PhD            0.383      0.329  0.027   -0.011  1.000    0.850    -0.131
## Terminal       0.408      0.375  0.100   -0.031  0.850    1.000    -0.160
## S.F.Ratio     -0.555     -0.363 -0.032    0.136 -0.131   -0.160     1.000
## perc.alumni    0.566      0.272 -0.040   -0.286  0.249    0.267    -0.403
## Expend         0.673      0.502  0.112   -0.098  0.433    0.439    -0.584
## Grad.Rate      0.571      0.425  0.001   -0.269  0.305    0.290    -0.307
##             perc.alumni Expend Grad.Rate
## Apps             -0.090  0.260     0.147
## Accept           -0.160  0.125     0.067
## Enroll           -0.181  0.064    -0.022
## Top10perc         0.455  0.661     0.495
## Top25perc         0.418  0.527     0.477
## F.Undergrad      -0.229  0.019    -0.079
## P.Undergrad      -0.281 -0.084    -0.257
## Outstate          0.566  0.673     0.571
## Room.Board        0.272  0.502     0.425
## Books            -0.040  0.112     0.001
## Personal         -0.286 -0.098    -0.269
## PhD               0.249  0.433     0.305
## Terminal          0.267  0.439     0.290
## S.F.Ratio        -0.403 -0.584    -0.307
## perc.alumni       1.000  0.418     0.491
## Expend            0.418  1.000     0.390
## Grad.Rate         0.491  0.390     1.000
corrplot(cor_table)

As can be seen, there is a high correlation between the variables: Accept, Enroll and F.Undergrad.

  1. Scatter Plat
par(mar=c(2,2,2,2))
par(mfrow=c(4,4))#4 rows and 4 columns
for(i in c(4:19)){
  plot(data[,i],data$Apps,main=paste("Scater Plot of Apps Vs.",names(data)[i]))
}

  1. categurical vS Numerical Variables
par(mar=c(2,2,2,2))
par(mfrow=c(4,4))#4 rows and 4 columns
for(i in c(4:19)){
  boxplot(data[,i]~Private,data=data,main=paste("box Plot of",names(data)[i],"according to Type of university "))
}

par(mfrow=c(1,1))

Identify outleires

Univariate Detection: 1. Apps

data_1<-data[,-1]
tukey_1<-quantile(data_1$Apps,probs=0.75)+1.5*IQR(data_1$Apps)
tukey_2<-quantile(data_1$Apps,probs=0.25)-1.5*IQR(data_1$Apps)
out_Apps_sum<-sum(data_1$Apps>tukey_1|data_1$Apps < tukey_2)
out_Apps_sum
## [1] 70
out_Apps_procent<-out_Apps_sum/nrow(data_1)*100
out_Apps_procent  #9% data in Apps is outleire
## [1] 9.009009
which((data_1$Apps > tukey_1|data_1$Apps < tukey_2))
##  [1]  24  60  62  71  88 119 142 159 175 177 192 204 222 251 270 275 278 280 285
## [20] 366 367 408 413 419 421 425 433 441 446 460 462 484 511 561 562 563 564 566
## [39] 570 571 577 582 606 607 612 615 620 621 624 625 627 634 635 638 641 650 652
## [58] 663 664 665 668 670 678 686 694 695 701 714 744 776
out_Apps<-data_1$Apps[data_1$Apps > tukey_1|data_1$Apps < tukey_2]
out_Apps
##  [1] 12809 20192  9251 12586  8728  8065  9478  8587 13789  9274  8506 11651
## [13] 11115 13865  8681 16587  8427 11223  8474  9239 18114 13594 10634 11901
## [25] 10706 12289 11023  8256 19315 13218 21804 48094  9402 13528 14463 15039
## [37] 12512  8000  8598  8399 10477 14474 19873 15698  9735 14446 12445 11220
## [49] 14939  8384  8579 14292 14438 19152 11054  9750 14596  8631 12394  8586
## [61]  9643  8766 12229 14752 15849 12749 14901 15712  9167 10705

1. other variables

+**show which data in different variable is Outlier:**
for(i in c(3:18) ){
  print(names(data_1)[i])
  tukey_1i<-quantile(data_1[,i],probs=0.75)+1.5*IQR(data_1[,i])
  tukey_2i<-quantile(data_1[,i],probs=0.25)-1.5*IQR(data_1[,i])
  out_Apps_sumi<-sum(data_1[,i]>tukey_1i|data_1[,i] < tukey_2i)
  print(out_Apps_sumi)
  out_Apps_procenti<-out_Apps_sumi/nrow(data_1)*100
  print(out_Apps_procenti)
  print(which(data_1[,i]>tukey_1i|data_1[,i] < tukey_2i))
}
## [1] "Accept"
## [1] 73
## [1] 9.395109
##  [1]  24  28  40  60  62  70  88 119 142 177 204 258 270 275 278 279 280 366 367
## [20] 408 413 419 421 425 433 446 462 484 511 537 561 562 563 564 577 582 606 607
## [39] 611 612 614 615 620 621 624 625 627 634 635 637 638 641 648 650 652 663 664
## [58] 665 668 670 676 678 684 686 693 694 695 701 711 714 728 729 744
## [1] "Enroll"
## [1] 79
## [1] 10.16731
##  [1]  22  24  28  40  60  62  70 119 142 177 204 223 270 274 275 278 280 289 325
## [20] 346 366 367 408 413 419 420 421 425 433 437 446 462 484 490 511 537 563 577
## [39] 582 586 605 606 607 608 611 612 615 620 621 624 625 627 634 635 638 641 643
## [58] 648 650 652 658 661 662 663 664 665 668 676 678 684 686 692 694 695 701 702
## [77] 714 728 744
## [1] "Top10perc"
## [1] 39
## [1] 5.019305
##  [1]  17  55  61  71  72  87  92 115 138 139 145 159 160 175 192 222 223 251 252
## [20] 285 355 408 425 447 460 606 607 610 638 652 661 664 694 709 721 726 734 764
## [39] 776
## [1] "Top25perc"
## [1] 0
## [1] 0
## integer(0)
## [1] "F.Undergrad"
## [1] 97
## [1] 12.48391
##  [1]  22  24  28  40  60  62  70  79  80 119 142 177 182 202 204 219 223 270 274
## [20] 275 278 280 281 289 325 341 366 367 376 383 408 413 419 420 421 433 437 446
## [39] 462 484 490 511 537 561 562 563 564 577 582 605 606 607 608 611 612 615 620
## [58] 621 623 624 625 627 629 634 635 638 641 643 648 650 652 653 658 660 662 663
## [77] 664 665 668 676 677 678 681 684 685 686 687 692 694 695 701 702 712 714 728
## [96] 744 747
## [1] "P.Undergrad"
## [1] 67
## [1] 8.622909
##  [1]  24  39  60 103 143 171 178 202 204 219 224 233 234 275 304 325 330 346 367
## [20] 384 408 413 419 420 428 441 462 483 484 511 534 537 563 582 586 604 608 615
## [39] 620 621 623 625 629 633 634 641 645 648 653 657 658 662 665 668 676 677 680
## [58] 684 685 686 687 692 695 696 702 712 744
## [1] "Outstate"
## [1] 1
## [1] 0.1287001
## [1] 48
## [1] "Room.Board"
## [1] 7
## [1] 0.9009009
## [1]  38 347 408 415 419 516 664
## [1] "Books"
## [1] 46
## [1] 5.920206
##  [1]   5  19  22  35  61  64  70  73 101 119 134 182 202 258 302 318 320 321 368
## [20] 369 374 378 388 390 397 422 437 455 465 472 481 483 498 514 531 534 574 579
## [39] 649 688 691 692 728 736 738 742
## [1] "Personal"
## [1] 20
## [1] 2.574003
##  [1] 190 202 224 296 299 318 369 431 437 474 498 581 625 645 662 684 686 692 712
## [20] 763
## [1] "PhD"
## [1] 8
## [1] 1.029601
## [1]  86  96 101 227 514 609 688 736
## [1] "Terminal"
## [1] 8
## [1] 1.029601
## [1]   2 227 265 342 369 395 507 740
## [1] "S.F.Ratio"
## [1] 12
## [1] 1.544402
##  [1]  92 227 276 285 312 318 364 495 609 687 729 744
## [1] "perc.alumni"
## [1] 5
## [1] 0.6435006
## [1]  17  87 107 243 764
## [1] "Expend"
## [1] 48
## [1] 6.177606
##  [1]   4  17  21  61  65  71  72  87  88  92 115 145 153 159 160 175 192 222 238
## [20] 243 251 252 285 355 391 408 425 460 498 516 529 576 594 610 637 664 670 678
## [39] 690 709 710 721 729 734 736 754 764 776
## [1] "Grad.Rate"
## [1] 4
## [1] 0.5148005
## [1]   5  96 385 586
max(out_Apps) 
## [1] 48094
data$College.Name[484] 
## [1] "Rutgers at New Brunswick"
# We expect this data to be high leverage.

multivariate detection

data_2<-data[,3:19]
mah_d2<-mahalanobis(data_2,colMeans(data[,3:19]),cov(data_2))
out_All<-mah_d2[mah_d2>10]
length(out_All)/length(mah_d2)*100
## [1] 56.49936

56% all of data set have the mahalanobis distance >10

out_All<-mah_d2[mah_d2>20]
length(out_All)/length(mah_d2)*100 
## [1] 19.94852

19.94% all of data set have the mahalanobis distance >20

out_All<-mah_d2[mah_d2>30]
length(out_All)/length(mah_d2)*100
## [1] 11.19691

11.19% all of data set have the mahalanobis distance >30

Remove data #484

data #484 besides Apps is also outlier in variables

  1. P.Undergrad,

  2. f.undergrad,

  3. accept,

  4. Enroll.

Data #484 probably to have been entered incorrectly. so for improving result, we removed it from the data.

data<-data[-which(row.names(data)==484),]

#Apps (dependent variable with out data #484 )
summary(data$Apps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      81     776    1558    2944    3603   21804
#distribution Apps with out data #484

#1.histogram
hist(data$Apps,xlab="Number of applications received",main=paste("Histogram of", "App"))

#2.boxplot
boxplot(data$Apps,main="box Plot App")

#3.QQ-plot
qqnorm(data$Apps, main="qqplot of applications received",pch=20)
qqline(data$Apps,col="red")

#4Jarque-Bera Test(Skewness=0) and Anscombe-glynn Test(kurtosis=3))
library(moments)
jarque.test(data$Apps)
## 
##  Jarque-Bera Normality Test
## 
## data:  data$Apps
## JB = 1722.4, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(data$Apps)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  data$Apps
## kurt = 8.6878, z = 10.1187, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#conclusion:Despite the improved results, Reject normality assumption for Apps distribution.

Remove college name

data_1<-data[,-1]

Train & Test

Divide Dataset into Train and Test

set.seed(123456)
train_cases <- sample(1:nrow(data_1), nrow(data_1) * 0.7)
train <- data_1[train_cases,]
test  <- data_1[- train_cases,]
summary(train)
##         Private         Apps             Accept            Enroll      
##  Private    :390   Min.   :   81.0   Min.   :   72.0   Min.   :  35.0  
##  Not Private:153   1st Qu.:  738.5   1st Qu.:  578.5   1st Qu.: 227.5  
##                    Median : 1603.0   Median : 1101.0   Median : 428.0  
##                    Mean   : 2983.8   Mean   : 1998.0   Mean   : 768.2  
##                    3rd Qu.: 3780.5   3rd Qu.: 2525.5   3rd Qu.: 902.5  
##                    Max.   :21804.0   Max.   :18744.0   Max.   :6180.0  
##    Top10perc       Top25perc       F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   : 13.00   Min.   :  139   Min.   :    1.0  
##  1st Qu.:15.00   1st Qu.: 40.00   1st Qu.:  962   1st Qu.:  107.5  
##  Median :23.00   Median : 53.00   Median : 1688   Median :  370.0  
##  Mean   :26.87   Mean   : 54.82   Mean   : 3662   Mean   :  886.4  
##  3rd Qu.:34.00   3rd Qu.: 67.00   3rd Qu.: 4056   3rd Qu.:  987.5  
##  Max.   :95.00   Max.   :100.00   Max.   :28938   Max.   :21836.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2340   Min.   :1780   Min.   : 110.0   Min.   : 300  
##  1st Qu.: 7366   1st Qu.:3613   1st Qu.: 475.5   1st Qu.: 900  
##  Median : 9990   Median :4218   Median : 500.0   Median :1220  
##  Mean   :10399   Mean   :4377   Mean   : 548.6   Mean   :1357  
##  3rd Qu.:12838   3rd Qu.:5075   3rd Qu.: 600.0   3rd Qu.:1658  
##  Max.   :19960   Max.   :7400   Max.   :2340.0   Max.   :6800  
##       PhD            Terminal       S.F.Ratio      perc.alumni   
##  Min.   :  8.00   Min.   : 24.0   Min.   : 2.90   Min.   : 0.00  
##  1st Qu.: 62.00   1st Qu.: 70.5   1st Qu.:11.40   1st Qu.:13.00  
##  Median : 75.00   Median : 81.0   Median :13.50   Median :21.00  
##  Mean   : 72.36   Mean   : 79.3   Mean   :14.06   Mean   :22.61  
##  3rd Qu.: 85.00   3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00  
##  Max.   :103.00   Max.   :100.0   Max.   :39.80   Max.   :63.00  
##      Expend        Grad.Rate  
##  Min.   : 3186   Min.   : 10  
##  1st Qu.: 6682   1st Qu.: 53  
##  Median : 8408   Median : 65  
##  Mean   : 9708   Mean   : 65  
##  3rd Qu.:10880   3rd Qu.: 78  
##  Max.   :56233   Max.   :118
summary(test)
##         Private         Apps           Accept          Enroll      
##  Private    :175   Min.   :  174   Min.   :  146   Min.   :  75.0  
##  Not Private: 58   1st Qu.:  805   1st Qu.:  638   1st Qu.: 274.0  
##                    Median : 1538   Median : 1141   Median : 452.0  
##                    Mean   : 2850   Mean   : 1963   Mean   : 791.3  
##                    3rd Qu.: 3540   3rd Qu.: 2140   3rd Qu.: 871.0  
##                    Max.   :16587   Max.   :13243   Max.   :6392.0  
##    Top10perc       Top25perc      F.Undergrad     P.Undergrad     
##  Min.   : 1.00   Min.   : 9.00   Min.   :  316   Min.   :    1.0  
##  1st Qu.:18.00   1st Qu.:45.00   1st Qu.: 1058   1st Qu.:   75.0  
##  Median :25.00   Median :57.00   Median : 1737   Median :  299.0  
##  Mean   :29.14   Mean   :57.98   Mean   : 3713   Mean   :  770.5  
##  3rd Qu.:37.00   3rd Qu.:71.00   3rd Qu.: 3796   3rd Qu.:  911.0  
##  Max.   :96.00   Max.   :99.00   Max.   :31643   Max.   :10221.0  
##     Outstate       Room.Board       Books           Personal   
##  Min.   : 2700   Min.   :2217   Min.   :  96.0   Min.   : 250  
##  1st Qu.: 7260   1st Qu.:3518   1st Qu.: 450.0   1st Qu.: 800  
##  Median : 9900   Median :4124   Median : 500.0   Median :1200  
##  Mean   :10552   Mean   :4311   Mean   : 550.6   Mean   :1300  
##  3rd Qu.:13240   3rd Qu.:5015   3rd Qu.: 600.0   3rd Qu.:1700  
##  Max.   :21700   Max.   :8124   Max.   :2000.0   Max.   :4913  
##       PhD            Terminal        S.F.Ratio      perc.alumni   
##  Min.   : 10.00   Min.   : 33.00   Min.   : 2.50   Min.   : 2.00  
##  1st Qu.: 63.00   1st Qu.: 71.00   1st Qu.:11.70   1st Qu.:13.00  
##  Median : 76.00   Median : 84.00   Median :13.70   Median :21.00  
##  Mean   : 73.27   Mean   : 80.58   Mean   :14.14   Mean   :23.06  
##  3rd Qu.: 86.00   3rd Qu.: 93.00   3rd Qu.:16.40   3rd Qu.:31.00  
##  Max.   :100.00   Max.   :100.00   Max.   :28.80   Max.   :64.00  
##      Expend        Grad.Rate    
##  Min.   : 3365   Min.   : 15.0  
##  1st Qu.: 6898   1st Qu.: 56.0  
##  Median : 8243   Median : 66.0  
##  Mean   : 9544   Mean   : 66.5  
##  3rd Qu.:10458   3rd Qu.: 77.0  
##  Max.   :33541   Max.   :100.0
dim(train)
## [1] 543  18
dim(test)
## [1] 233  18

Building Prediction Model

Model1:Teaditional Linear Regression

Traditional_lm_1 <- lm(Apps ~ ., data= train)
summary(Traditional_lm_1)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3379.6  -407.7   -55.2   345.2  6779.3 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.409e+03  4.377e+02  -3.218 0.001369 ** 
## PrivateNot Private  5.240e+02  1.629e+02   3.216 0.001379 ** 
## Accept              1.273e+00  5.973e-02  21.305  < 2e-16 ***
## Enroll             -2.995e-01  2.303e-01  -1.300 0.194053    
## Top10perc           5.114e+01  6.330e+00   8.079 4.53e-15 ***
## Top25perc          -1.267e+01  5.050e+00  -2.510 0.012385 *  
## F.Undergrad         8.808e-02  3.545e-02   2.484 0.013289 *  
## P.Undergrad         3.464e-02  3.508e-02   0.988 0.323845    
## Outstate           -5.346e-02  2.238e-02  -2.389 0.017236 *  
## Room.Board          1.922e-01  5.689e-02   3.379 0.000783 ***
## Books               2.132e-01  2.889e-01   0.738 0.460885    
## Personal           -2.472e-02  7.100e-02  -0.348 0.727882    
## PhD                -1.026e+01  5.321e+00  -1.929 0.054264 .  
## Terminal            6.484e-01  5.697e+00   0.114 0.909430    
## S.F.Ratio           1.443e+01  1.461e+01   0.988 0.323847    
## perc.alumni        -5.536e+00  4.723e+00  -1.172 0.241590    
## Expend              5.757e-02  1.307e-02   4.405 1.28e-05 ***
## Grad.Rate           9.018e+00  3.342e+00   2.698 0.007193 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1001 on 525 degrees of freedom
## Multiple R-squared:  0.9256, Adjusted R-squared:  0.9232 
## F-statistic: 384.2 on 17 and 525 DF,  p-value: < 2.2e-16
#R-squared & Adjusted R-squared are close to 1 and 92.32% of AppS data is explained by other variables. 
#Reject H0 for F-test(There is atleast one linear relationship )
# T- test result: variables:Accept Assumption of H0 for variables Enroll,P.Undergrad & Books & Personal & Terminal 
#& S.F.Ratio & perc.alumni are not in model
# T- test result for PhD reject hardly so we keep it in the Model.
#Delete insignificant variables 

Traditional_lm_2 <- lm(Apps ~ .-Enroll-P.Undergrad-Books-Personal-Terminal-S.F.Ratio-perc.alumni,data= train)
summary(Traditional_lm_2)
## 
## Call:
## lm(formula = Apps ~ . - Enroll - P.Undergrad - Books - Personal - 
##     Terminal - S.F.Ratio - perc.alumni, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3395.7  -418.6   -27.8   338.3  6653.7 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.181e+03  2.995e+02  -3.942 9.17e-05 ***
## PrivateNot Private  5.833e+02  1.586e+02   3.679 0.000258 ***
## Accept              1.226e+00  4.440e-02  27.610  < 2e-16 ***
## Top10perc           4.893e+01  6.139e+00   7.971 9.63e-15 ***
## Top25perc          -1.221e+01  4.944e+00  -2.470 0.013842 *  
## F.Undergrad         6.411e-02  2.347e-02   2.731 0.006520 ** 
## Outstate           -5.938e-02  2.171e-02  -2.736 0.006431 ** 
## Room.Board          2.303e-01  5.388e-02   4.274 2.28e-05 ***
## PhD                -9.936e+00  3.638e+00  -2.731 0.006524 ** 
## Expend              5.218e-02  1.205e-02   4.330 1.79e-05 ***
## Grad.Rate           7.587e+00  3.208e+00   2.365 0.018384 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1001 on 532 degrees of freedom
## Multiple R-squared:  0.9247, Adjusted R-squared:  0.9233 
## F-statistic: 653.1 on 10 and 532 DF,  p-value: < 2.2e-16
#R-squared & Adjusted R-squared are close to 1 so 92.33% of AppS data is explained by other variables. 
#Reject H0 for F-test(There is atleast one linear relationship)
#Reject H0 forT- testt:Reject Assumption of H0 for all variables.
#Check Assumptions of Regression for Model Traditional_lm_2

#Normality of residuals
#1.Histogram
hist(Traditional_lm_2$residuals, probability = TRUE, main = "Histogram of residuals in Model 'Traditional_lm_2'", breaks = 25)
lines(density(Traditional_lm_2$residuals), col = "red")

#2.QQ-plot
qqnorm(Traditional_lm_2$residuals, main = "QQ Plot of residuals of Model 'Traditional_lm_2'", pch = 20)
qqline(Traditional_lm_2$residuals, col = "red")

#It has serious deviations from the 45 degree line, which is not a small number. Distribution is probably not normal. 

#3.Test for Skewness and Kurtosis
jarque.test(Traditional_lm_2$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  Traditional_lm_2$residuals
## JB = 2965.5, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_2$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  Traditional_lm_2$residuals
## kurt = 13.759, z = 10.557, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
#Cook's distance > 1
which(cooks.distance(Traditional_lm_2)>1)
## named integer(0)
#Diagnostic Plots
plot(Traditional_lm_2)

#Residuals Vs Fitted shows Pattern so we probablity have collinearity problem
#Q-Q plot shows It has serious deviations from the 45 degree line,so the observation of 251-694-71 probably cause Distribution not normal 
#Standard error root vs Fitted(y^) shows Pattern
#Residuals Vs leverage shows the outliers using Cook's distance.that measure the effect of existing the observations on the model.Usually sizes above 1 cause trouble and should be removed.
#Check multicollinearity
car :: vif(Traditional_lm_2)
##     Private      Accept   Top10perc   Top25perc F.Undergrad    Outstate 
##    2.759054    5.697257    6.349038    5.181526    6.485246    4.047322 
##  Room.Board         PhD      Expend   Grad.Rate 
##    1.866515    1.891698    2.342980    1.684368
#no multicollinearity problem.
#Conclusion: severe violation of regression assumption
#Bad model!

Test the Model Traditional_lm_2

#Prediction
pred_Tlm <- predict(Traditional_lm_2, test)
head(pred_Tlm ,5)
##         8        13        15        18        20 
## 2566.7247 1831.8215  677.0293 1088.8244 3032.5750
#Absolute error mean, median, sd, max, min-------
abs_err_Tlm <- abs(pred_Tlm - test$Apps)
mean(abs_err_Tlm)
## [1] 599.7741
median(abs_err_Tlm)
## [1] 375.7187
sd(abs_err_Tlm)
## [1] 734.9941
range(abs_err_Tlm)
## [1]    2.019535 6893.422430
#histogram and boxplot
hist(abs_err_Tlm, breaks = 15)

boxplot(abs_err_Tlm)

#Actual vs. Predicted
plot(test$Apps, pred_Tlm , xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model2:Step-by-step we can use step by step removed outliers here and then model it

#1-1-------------
#remove case #251-694-71 .

train2<- train[-which(rownames(train) %in% c(251,694,71)),]
dim(train2)
## [1] 540  18
dim(train)
## [1] 543  18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
## 
## Call:
## lm(formula = Apps ~ ., data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3600.0  -351.1   -22.6   267.8  5905.2 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.222e+03  3.842e+02  -3.180 0.001562 ** 
## PrivateNot Private  4.083e+02  1.428e+02   2.859 0.004426 ** 
## Accept              1.381e+00  5.288e-02  26.107  < 2e-16 ***
## Enroll             -6.699e-01  2.033e-01  -3.295 0.001052 ** 
## Top10perc           4.128e+01  5.598e+00   7.375 6.51e-13 ***
## Top25perc          -7.023e+00  4.439e+00  -1.582 0.114228    
## F.Undergrad         1.030e-01  3.100e-02   3.324 0.000950 ***
## P.Undergrad         5.393e-02  3.069e-02   1.758 0.079410 .  
## Outstate           -6.139e-02  1.959e-02  -3.133 0.001825 ** 
## Room.Board          1.848e-01  4.978e-02   3.713 0.000227 ***
## Books               3.067e-01  2.529e-01   1.213 0.225865    
## Personal           -3.492e-02  6.216e-02  -0.562 0.574568    
## PhD                -8.996e+00  4.649e+00  -1.935 0.053505 .  
## Terminal            1.374e+00  4.977e+00   0.276 0.782598    
## S.F.Ratio           1.078e+01  1.287e+01   0.838 0.402521    
## perc.alumni        -4.002e+00  4.138e+00  -0.967 0.333891    
## Expend              4.655e-02  1.165e-02   3.995 7.40e-05 ***
## Grad.Rate           6.109e+00  2.929e+00   2.086 0.037506 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 874.3 on 522 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9384 
## F-statistic: 484.4 on 17 and 522 DF,  p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Top25perc-Books-Personal-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
## 
## Call:
## lm(formula = Apps ~ . - Top25perc - Books - Personal - Terminal - 
##     S.F.Ratio - perc.alumni, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3610.0  -371.7   -25.3   270.8  6001.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.071e+03  2.522e+02  -4.247 2.56e-05 ***
## PrivateNot Private  4.346e+02  1.391e+02   3.123 0.001888 ** 
## Accept              1.388e+00  5.154e-02  26.940  < 2e-16 ***
## Enroll             -6.746e-01  1.999e-01  -3.374 0.000795 ***
## Top10perc           3.427e+01  3.357e+00  10.209  < 2e-16 ***
## F.Undergrad         1.014e-01  3.064e-02   3.309 0.001000 ***
## P.Undergrad         5.466e-02  3.044e-02   1.795 0.073174 .  
## Outstate           -6.948e-02  1.899e-02  -3.658 0.000280 ***
## Room.Board          2.060e-01  4.799e-02   4.292 2.11e-05 ***
## PhD                -9.016e+00  3.182e+00  -2.833 0.004787 ** 
## Expend              4.549e-02  1.055e-02   4.311 1.94e-05 ***
## Grad.Rate           5.069e+00  2.838e+00   1.786 0.074651 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 874.4 on 528 degrees of freedom
## Multiple R-squared:  0.9397, Adjusted R-squared:  0.9384 
## F-statistic: 747.8 on 11 and 528 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4 

#Diagnostic Plots
plot(Traditional_lm_4)

#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  Traditional_lm_4$residuals
## JB = 3028, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  Traditional_lm_4$residuals
## kurt = 13.964, z = 10.586, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
# outliers of 606,175 probably cause Distribution not normal  
#1-2---------------------
#remove case #175,606 .

train2<- train[-which(rownames(train) %in% c(251,694,71,175,606)),]
dim(train2)
## [1] 538  18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
## 
## Call:
## lm(formula = Apps ~ ., data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3513.6  -334.2   -33.8   239.9  4840.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.325e+03  3.516e+02  -3.770 0.000182 ***
## PrivateNot Private  3.769e+02  1.307e+02   2.884 0.004085 ** 
## Accept              1.387e+00  4.838e-02  28.674  < 2e-16 ***
## Enroll             -6.184e-01  1.867e-01  -3.313 0.000988 ***
## Top10perc           2.829e+01  5.300e+00   5.337 1.41e-07 ***
## Top25perc          -1.452e-01  4.121e+00  -0.035 0.971910    
## F.Undergrad         8.163e-02  2.859e-02   2.856 0.004467 ** 
## P.Undergrad         6.308e-02  2.808e-02   2.246 0.025124 *  
## Outstate           -5.991e-02  1.792e-02  -3.344 0.000887 ***
## Room.Board          1.564e-01  4.568e-02   3.424 0.000666 ***
## Books               3.399e-01  2.314e-01   1.469 0.142378    
## Personal           -4.073e-02  5.686e-02  -0.716 0.474047    
## PhD                -7.066e+00  4.258e+00  -1.659 0.097625 .  
## Terminal            1.197e+00  4.552e+00   0.263 0.792736    
## S.F.Ratio           1.203e+01  1.177e+01   1.022 0.307387    
## perc.alumni        -3.079e+00  3.788e+00  -0.813 0.416588    
## Expend              4.985e-02  1.070e-02   4.658 4.06e-06 ***
## Grad.Rate           5.799e+00  2.679e+00   2.164 0.030900 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 799.6 on 520 degrees of freedom
## Multiple R-squared:  0.9471, Adjusted R-squared:  0.9454 
## F-statistic: 547.8 on 17 and 520 DF,  p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Top25perc-Books-Personal-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
## 
## Call:
## lm(formula = Apps ~ . - Top25perc - Books - Personal - Terminal - 
##     S.F.Ratio - perc.alumni, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3532.8  -333.8   -41.0   221.8  4790.5 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.007e+03  2.305e+02  -4.372 1.49e-05 ***
## PrivateNot Private  4.116e+02  1.271e+02   3.239 0.001277 ** 
## Accept              1.397e+00  4.710e-02  29.659  < 2e-16 ***
## Enroll             -6.402e-01  1.833e-01  -3.493 0.000519 ***
## Top10perc           2.835e+01  3.129e+00   9.060  < 2e-16 ***
## F.Undergrad         8.344e-02  2.820e-02   2.959 0.003225 ** 
## P.Undergrad         6.314e-02  2.782e-02   2.270 0.023635 *  
## Outstate           -6.524e-02  1.735e-02  -3.760 0.000189 ***
## Room.Board          1.771e-01  4.399e-02   4.025 6.55e-05 ***
## PhD                -6.728e+00  2.916e+00  -2.307 0.021446 *  
## Expend              4.541e-02  9.677e-03   4.693 3.44e-06 ***
## Grad.Rate           5.228e+00  2.593e+00   2.017 0.044250 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 798.6 on 526 degrees of freedom
## Multiple R-squared:  0.9466, Adjusted R-squared:  0.9455 
## F-statistic: 848.3 on 11 and 526 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4 

#Diagnostic Plots
plot(Traditional_lm_4)

#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  Traditional_lm_4$residuals
## JB = 2303.9, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  Traditional_lm_4$residuals
## kurt = 12.544, z = 10.165, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
# outliers of 562 probably cause Distribution not normal
#1-3-----------
#remove case #562 .

train2<- train[-which(rownames(train) %in% c(251,694,71,175,606,562)),]
dim(train2)
## [1] 537  18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
## 
## Call:
## lm(formula = Apps ~ ., data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3369.2  -318.2   -35.7   215.0  4906.5 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.309e+03  3.408e+02  -3.839 0.000139 ***
## PrivateNot Private  3.638e+02  1.267e+02   2.871 0.004256 ** 
## Accept              1.356e+00  4.720e-02  28.727  < 2e-16 ***
## Enroll             -5.500e-01  1.813e-01  -3.033 0.002541 ** 
## Top10perc           2.569e+01  5.157e+00   4.981 8.64e-07 ***
## Top25perc           6.500e-04  3.995e+00   0.000 0.999870    
## F.Undergrad         8.456e-02  2.772e-02   3.051 0.002397 ** 
## P.Undergrad         6.219e-02  2.723e-02   2.284 0.022757 *  
## Outstate           -5.062e-02  1.744e-02  -2.902 0.003868 ** 
## Room.Board          1.525e-01  4.429e-02   3.444 0.000620 ***
## Books               3.201e-01  2.243e-01   1.427 0.154239    
## Personal           -2.998e-02  5.515e-02  -0.544 0.586909    
## PhD                -5.523e+00  4.136e+00  -1.335 0.182315    
## Terminal           -7.204e-01  4.425e+00  -0.163 0.870742    
## S.F.Ratio           1.137e+01  1.142e+01   0.996 0.319744    
## perc.alumni        -2.725e+00  3.672e+00  -0.742 0.458464    
## Expend              5.223e-02  1.038e-02   5.031 6.75e-07 ***
## Grad.Rate           5.411e+00  2.598e+00   2.083 0.037785 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 775.1 on 519 degrees of freedom
## Multiple R-squared:  0.9493, Adjusted R-squared:  0.9477 
## F-statistic: 571.8 on 17 and 519 DF,  p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Top25perc-Books-Personal-PhD-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
## 
## Call:
## lm(formula = Apps ~ . - Top25perc - Books - Personal - PhD - 
##     Terminal - S.F.Ratio - perc.alumni, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3372.4  -336.1   -19.9   227.8  4917.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.242e+03  2.028e+02  -6.127 1.76e-09 ***
## PrivateNot Private  3.002e+02  1.172e+02   2.562 0.010697 *  
## Accept              1.366e+00  4.612e-02  29.623  < 2e-16 ***
## Enroll             -5.665e-01  1.787e-01  -3.171 0.001609 ** 
## Top10perc           2.396e+01  2.951e+00   8.121 3.32e-15 ***
## F.Undergrad         8.323e-02  2.740e-02   3.038 0.002502 ** 
## P.Undergrad         5.776e-02  2.697e-02   2.142 0.032656 *  
## Outstate           -6.564e-02  1.643e-02  -3.996 7.37e-05 ***
## Room.Board          1.579e-01  4.256e-02   3.710 0.000229 ***
## Expend              4.698e-02  9.413e-03   4.991 8.19e-07 ***
## Grad.Rate           4.770e+00  2.522e+00   1.892 0.059102 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 776.8 on 526 degrees of freedom
## Multiple R-squared:  0.9484, Adjusted R-squared:  0.9474 
## F-statistic: 966.9 on 10 and 526 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4 

#Diagnostic Plots
plot(Traditional_lm_4)

#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  Traditional_lm_4$residuals
## JB = 2357.4, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  Traditional_lm_4$residuals
## kurt = 12.686, z = 10.200, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
# outliers of 446 probably cause Distribution not normal


#Residuales Distribution is far from the normal distribution.---
#1-4-----------
#remove case #446 .

train2<- train[-which(rownames(train) %in% c(251,694,71,175,606,562,446)),]
dim(train2)
## [1] 536  18
Traditional_lm_3 <- lm(Apps ~ ., data= train2)
summary(Traditional_lm_3)
## 
## Call:
## lm(formula = Apps ~ ., data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3230.4  -329.2   -27.1   232.2  4848.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.384e+03  3.317e+02  -4.173 3.52e-05 ***
## PrivateNot Private  3.981e+02  1.233e+02   3.228  0.00133 ** 
## Accept              1.326e+00  4.620e-02  28.709  < 2e-16 ***
## Enroll             -1.814e-01  1.883e-01  -0.963  0.33590    
## Top10perc           2.476e+01  5.017e+00   4.936 1.08e-06 ***
## Top25perc          -1.165e-01  3.885e+00  -0.030  0.97609    
## F.Undergrad         1.234e-02  2.991e-02   0.412  0.68018    
## P.Undergrad         8.012e-02  2.667e-02   3.004  0.00279 ** 
## Outstate           -5.595e-02  1.699e-02  -3.294  0.00106 ** 
## Room.Board          1.647e-01  4.312e-02   3.820  0.00015 ***
## Books               3.708e-01  2.183e-01   1.698  0.09003 .  
## Personal           -3.881e-02  5.365e-02  -0.724  0.46968    
## PhD                -3.055e+00  4.046e+00  -0.755  0.45049    
## Terminal           -2.169e+00  4.311e+00  -0.503  0.61503    
## S.F.Ratio           1.145e+01  1.110e+01   1.032  0.30253    
## perc.alumni        -3.717e+00  3.575e+00  -1.040  0.29894    
## Expend              5.458e-02  1.010e-02   5.402 1.00e-07 ***
## Grad.Rate           5.920e+00  2.528e+00   2.342  0.01958 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 753.7 on 518 degrees of freedom
## Multiple R-squared:   0.95,  Adjusted R-squared:  0.9483 
## F-statistic: 578.5 on 17 and 518 DF,  p-value: < 2.2e-16
Traditional_lm_4 <- lm(Apps ~ .-Enroll-Top25perc-F.Undergrad-Personal-PhD-Terminal-S.F.Ratio-perc.alumni,data= train2)
summary(Traditional_lm_4)
## 
## Call:
## lm(formula = Apps ~ . - Enroll - Top25perc - F.Undergrad - Personal - 
##     PhD - Terminal - S.F.Ratio - perc.alumni, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3149.2  -335.6   -28.2   234.1  4735.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.533e+03  2.144e+02  -7.153 2.86e-12 ***
## PrivateNot Private  3.493e+02  1.120e+02   3.120  0.00191 ** 
## Accept              1.289e+00  1.897e-02  67.943  < 2e-16 ***
## Top10perc           2.192e+01  2.772e+00   7.909 1.54e-14 ***
## P.Undergrad         6.870e-02  2.419e-02   2.840  0.00468 ** 
## Outstate           -6.483e-02  1.572e-02  -4.123 4.34e-05 ***
## Room.Board          1.728e-01  4.116e-02   4.197 3.17e-05 ***
## Books               3.640e-01  2.116e-01   1.720  0.08596 .  
## Expend              4.882e-02  9.145e-03   5.338 1.40e-07 ***
## Grad.Rate           5.476e+00  2.444e+00   2.240  0.02548 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 753.7 on 526 degrees of freedom
## Multiple R-squared:  0.9492, Adjusted R-squared:  0.9483 
## F-statistic:  1092 on 9 and 526 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_4 

#Diagnostic Plots
plot(Traditional_lm_4)

#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_4$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  Traditional_lm_4$residuals
## JB = 2122.7, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_4$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  Traditional_lm_4$residuals
## kurt = 12.208, z = 10.042, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
#Residuales Distribution is far from the normal distribution.

Train dataset without outliers

trimmed_train <- train[- which(train$Apps > tukey_1), ]
dim(trimmed_train)
## [1] 493  18
dim(train)
## [1] 543  18
summary(trimmed_train$Apps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      81     692    1321    2038    2925    7888
(dim(train)-dim(trimmed_train))/dim(train)*100
## [1] 9.208103 0.000000
#9% data are outliers

Model3: Building Model with trimmed_train (Traditional_lm_trimmed_train_2)

Traditional_lm_trimmed_train <- lm(Apps ~ ., data= trimmed_train)
summary(Traditional_lm_trimmed_train)
## 
## Call:
## lm(formula = Apps ~ ., data = trimmed_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1554.63  -261.13   -27.51   187.94  3005.16 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.318e+03  2.321e+02  -5.682 2.33e-08 ***
## PrivateNot Private  3.450e+02  8.795e+01   3.922 0.000101 ***
## Accept              1.427e+00  4.761e-02  29.969  < 2e-16 ***
## Enroll             -6.476e-01  1.610e-01  -4.023 6.68e-05 ***
## Top10perc           1.299e+01  3.688e+00   3.523 0.000469 ***
## Top25perc           1.836e+00  2.786e+00   0.659 0.510207    
## F.Undergrad         4.931e-02  2.470e-02   1.997 0.046441 *  
## P.Undergrad         5.667e-02  2.487e-02   2.279 0.023133 *  
## Outstate           -2.136e-02  1.201e-02  -1.779 0.075821 .  
## Room.Board          6.349e-02  3.023e-02   2.100 0.036214 *  
## Books               5.982e-01  1.483e-01   4.032 6.44e-05 ***
## Personal           -4.418e-02  3.714e-02  -1.190 0.234806    
## PhD                 5.138e-01  2.738e+00   0.188 0.851218    
## Terminal           -1.380e+00  2.904e+00  -0.475 0.634834    
## S.F.Ratio           2.100e+01  7.819e+00   2.686 0.007492 ** 
## perc.alumni         8.709e-01  2.469e+00   0.353 0.724439    
## Expend              2.413e-02  7.932e-03   3.042 0.002483 ** 
## Grad.Rate           3.077e+00  1.750e+00   1.758 0.079368 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 502.1 on 475 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9245 
## F-statistic: 355.4 on 17 and 475 DF,  p-value: < 2.2e-16
#Delete insignificant variables 

Traditional_lm_trimmed_train_2 <- lm(Apps ~ .-Top25perc-Personal-PhD-Terminal-perc.alumni,data=trimmed_train)
summary(Traditional_lm_trimmed_train_2)
## 
## Call:
## lm(formula = Apps ~ . - Top25perc - Personal - PhD - Terminal - 
##     perc.alumni, data = trimmed_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1553.41  -274.16   -21.81   177.43  3032.54 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.391e+03  2.022e+02  -6.880 1.88e-11 ***
## PrivateNot Private  3.362e+02  8.353e+01   4.024 6.64e-05 ***
## Accept              1.422e+00  4.651e-02  30.573  < 2e-16 ***
## Enroll             -6.425e-01  1.581e-01  -4.063 5.65e-05 ***
## Top10perc           1.502e+01  2.064e+00   7.276 1.41e-12 ***
## F.Undergrad         4.888e-02  2.447e-02   1.997  0.04636 *  
## P.Undergrad         5.027e-02  2.434e-02   2.065  0.03949 *  
## Outstate           -1.977e-02  1.138e-02  -1.737  0.08306 .  
## Room.Board          6.359e-02  2.934e-02   2.168  0.03069 *  
## Books               5.676e-01  1.437e-01   3.951 8.94e-05 ***
## S.F.Ratio           2.165e+01  7.682e+00   2.818  0.00503 ** 
## Expend              2.309e-02  7.772e-03   2.971  0.00312 ** 
## Grad.Rate           3.406e+00  1.698e+00   2.006  0.04541 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 500.7 on 480 degrees of freedom
## Multiple R-squared:  0.9268, Adjusted R-squared:  0.9249 
## F-statistic: 506.2 on 12 and 480 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression for Model Traditional_lm_trimmed_train_2

#Normality of residuals
#1.Histogram
hist(Traditional_lm_trimmed_train_2$residuals, probability = TRUE, main = "Histogram of residuals in Model 'Traditional_lm_trimmed_train_2'", breaks = 25)
lines(density(Traditional_lm_trimmed_train_2$residuals), col = "red")

#QQ-plot
qqnorm(Traditional_lm_trimmed_train_2$residuals, main = "QQ Plot of residuals of Model 'Traditional_lm_trimmed_train_2'", pch = 20)
qqline(Traditional_lm_trimmed_train_2$residuals, col = "red")

#Test for Skewness and Kurtosis
jarque.test(Traditional_lm_trimmed_train_2$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  Traditional_lm_trimmed_train_2$residuals
## JB = 1072.6, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(Traditional_lm_trimmed_train_2$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  Traditional_lm_trimmed_train_2$residuals
## kurt = 9.6074, z = 8.7112, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
#Note: Residuals are not Normally Distributed, so regression assumption is not valid!
#Cook's distance > 1
which(cooks.distance(Traditional_lm_trimmed_train_2)>1)
## named integer(0)
#Diagnostic Plots
plot(Traditional_lm_trimmed_train_2)

#Check multicollinearity
car :: vif(Traditional_lm_trimmed_train_2)# multicollinearity problem.
##     Private      Accept      Enroll   Top10perc F.Undergrad P.Undergrad 
##    2.483746    6.455307   16.069740    2.024555   11.587167    1.832461 
##    Outstate  Room.Board       Books   S.F.Ratio      Expend   Grad.Rate 
##    3.952362    1.930230    1.067001    1.768344    2.480500    1.726186
#Conclusion: severe violation of regression assumption
#Bad model!

Test the Model Traditional_lm_trimmed_train_2

#Prediction
pred_Tlm_trimmed <- predict(Traditional_lm_trimmed_train_2, test)
head(pred_Tlm_trimmed,5)
##         8        13        15        18        20 
## 2473.9270 1365.9194  475.2189 1000.1578 2897.7983
#Absolute error mean, median, sd, max, min-------
abs_err_Tlm_trimmed <- abs(pred_Tlm_trimmed - test$Apps)
mean(abs_err_Tlm_trimmed)
## [1] 511.9436
median(abs_err_Tlm_trimmed)
## [1] 253.6845
sd(abs_err_Tlm_trimmed)
## [1] 910.2706
range(abs_err_Tlm_trimmed)
## [1]    4.905429 9327.357739
#histogram and boxplot
hist(abs_err_Tlm_trimmed, breaks = 15)

boxplot(abs_err_Tlm_trimmed)

#Actual vs. Predicted
plot(test$Apps, pred_Tlm_trimmed , xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Box-Cox Transformation

library("MASS")
box_results <- boxcox(Apps ~ ., data = train, lambda = seq(-5, 5, 0.1))               

box_results <- data.frame(box_results$x, box_results$y) 
# Create a data frame with the results
lambda_1 <- box_results[which(box_results$box_results.y == max(box_results$box_results.y)), 1]
lambda_1
## [1] 0.5
#lambda is different from 0 .So y=(y^lambda-1)/lambda
#Transformation
train$transform_Apps <-((train$Apps ^ lambda_1) - 1) / lambda_1

Model4: Model with transform_Apps

lm_transform_Apps_1 <- lm(transform_Apps ~ . - Apps, data = train)
summary(lm_transform_Apps_1)
## 
## Call:
## lm(formula = transform_Apps ~ . - Apps, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.495   -9.039    0.343    8.187   71.446 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.931e+01  7.347e+00  -3.989 7.56e-05 ***
## PrivateNot Private  2.122e+01  2.735e+00   7.760 4.46e-14 ***
## Accept              1.533e-02  1.003e-03  15.288  < 2e-16 ***
## Enroll              4.950e-03  3.866e-03   1.280 0.201048    
## Top10perc           4.699e-01  1.063e-01   4.422 1.19e-05 ***
## Top25perc          -1.089e-02  8.476e-02  -0.128 0.897817    
## F.Undergrad        -1.356e-04  5.951e-04  -0.228 0.819907    
## P.Undergrad         1.136e-03  5.888e-04   1.928 0.054345 .  
## Outstate            1.981e-04  3.756e-04   0.527 0.598132    
## Room.Board          3.517e-03  9.550e-04   3.683 0.000254 ***
## Books               1.182e-02  4.850e-03   2.437 0.015145 *  
## Personal            1.390e-03  1.192e-03   1.166 0.243956    
## PhD                 1.001e-01  8.932e-02   1.120 0.263124    
## Terminal           -3.558e-02  9.563e-02  -0.372 0.710046    
## S.F.Ratio           1.097e+00  2.452e-01   4.472 9.50e-06 ***
## perc.alumni        -1.355e-01  7.927e-02  -1.709 0.088061 .  
## Expend              1.072e-03  2.194e-04   4.885 1.38e-06 ***
## Grad.Rate           2.485e-01  5.610e-02   4.430 1.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.8 on 525 degrees of freedom
## Multiple R-squared:  0.9075, Adjusted R-squared:  0.9045 
## F-statistic: 302.9 on 17 and 525 DF,  p-value: < 2.2e-16
#Delete insignificant variables 
lm_transform_Apps_2 <- lm(transform_Apps ~ . - Apps-Enroll-Top25perc-F.Undergrad-Outstate-Personal-PhD-Terminal  , data = train)
summary(lm_transform_Apps_2)
## 
## Call:
## lm(formula = transform_Apps ~ . - Apps - Enroll - Top25perc - 
##     F.Undergrad - Outstate - Personal - PhD - Terminal, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -106.362   -9.468    0.176    8.250   74.767 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.314e+01  6.320e+00  -3.662 0.000276 ***
## PrivateNot Private  2.219e+01  2.338e+00   9.495  < 2e-16 ***
## Accept              1.680e-02  4.197e-04  40.031  < 2e-16 ***
## Top10perc           5.105e-01  5.950e-02   8.579  < 2e-16 ***
## P.Undergrad         1.556e-03  5.373e-04   2.895 0.003946 ** 
## Room.Board          3.452e-03  8.276e-04   4.171 3.54e-05 ***
## Books               1.193e-02  4.683e-03   2.548 0.011115 *  
## S.F.Ratio           1.087e+00  2.429e-01   4.474 9.41e-06 ***
## perc.alumni        -1.155e-01  7.526e-02  -1.535 0.125476    
## Expend              1.128e-03  2.076e-04   5.436 8.32e-08 ***
## Grad.Rate           2.421e-01  5.511e-02   4.394 1.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.79 on 532 degrees of freedom
## Multiple R-squared:  0.9064, Adjusted R-squared:  0.9046 
## F-statistic: 515.2 on 10 and 532 DF,  p-value: < 2.2e-16
#Delete insignificant variables 

lm_transform_Apps_3 <- lm(transform_Apps ~ . - Apps-Enroll-Top25perc-F.Undergrad-Outstate-Personal-PhD-Terminal-perc.alumni  , data = train)
summary(lm_transform_Apps_3)
## 
## Call:
## lm(formula = transform_Apps ~ . - Apps - Enroll - Top25perc - 
##     F.Undergrad - Outstate - Personal - PhD - Terminal - perc.alumni, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.515   -9.250    0.223    8.351   74.224 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.503e+01  6.207e+00  -4.032 6.33e-05 ***
## PrivateNot Private  2.268e+01  2.319e+00   9.777  < 2e-16 ***
## Accept              1.689e-02  4.161e-04  40.598  < 2e-16 ***
## Top10perc           4.922e-01  5.837e-02   8.432 3.20e-16 ***
## P.Undergrad         1.579e-03  5.378e-04   2.937  0.00346 ** 
## Room.Board          3.546e-03  8.263e-04   4.291 2.11e-05 ***
## Books               1.218e-02  4.687e-03   2.599  0.00962 ** 
## S.F.Ratio           1.130e+00  2.416e-01   4.678 3.67e-06 ***
## Expend              1.100e-03  2.070e-04   5.314 1.58e-07 ***
## Grad.Rate           2.198e-01  5.322e-02   4.130 4.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.81 on 533 degrees of freedom
## Multiple R-squared:  0.906,  Adjusted R-squared:  0.9044 
## F-statistic: 570.7 on 9 and 533 DF,  p-value: < 2.2e-16
#Check Assumptions of Regression----
#Normality of residuals

#1.Histogram
hist(lm_transform_Apps_3$residuals, probability = TRUE, main = "Histogram of residuals in Model 'lm_transform_Apps_3'", breaks = 25)
lines(density(lm_transform_Apps_3$residuals), col = "red")

#2.QQ-plot
qqnorm(lm_transform_Apps_3$residuals, main = "QQ Plot of residuals of Model 'lm_transform_Apps_3'")
qqline(lm_transform_Apps_3$residuals, col = "red")

#improve Model

#3.Test for Skewness and Kurtosis
jarque.test(lm_transform_Apps_3$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  lm_transform_Apps_3$residuals
## JB = 601.94, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(lm_transform_Apps_3$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  lm_transform_Apps_3$residuals
## kurt = 8.1514, z = 8.3056, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
plot(lm_transform_Apps_3)

car :: vif(lm_transform_Apps_3)
##     Private      Accept   Top10perc P.Undergrad  Room.Board       Books 
##    2.091806    1.773306    2.034090    1.427575    1.555793    1.045636 
##   S.F.Ratio      Expend   Grad.Rate 
##    1.804452    2.449353    1.642820
#no multicollinearity problem.

#Note: Residuals are not Normally Distributed, so regression assumption is not valid!

Test the Model Prediction:lm_transform_Apps_3

pred_lm_transform <- predict(lm_transform_Apps_3, test)
pred_lm_transform<-((lambda_1*pred_lm_transform)+1)^(1/lambda_1)
head(pred_lm_transform,5)
##         8        13        15        18        20 
## 2072.3496 1375.0491  623.1701  804.3528 2728.4601
#Absolute error mean, median, sd, max, min-------
abs_err_lm_transform <- abs(pred_lm_transform - test$Apps)
mean(abs_err_lm_transform)
## [1] 711.1906
median(abs_err_lm_transform)
## [1] 370.7594
sd(abs_err_lm_transform)
## [1] 1116.981
range(abs_err_lm_transform)
## [1]    0.5630465 7773.2584894
#histogram and boxplot
hist(abs_err_lm_transform, breaks = 15)

boxplot(abs_err_lm_transform)

#Actual vs. Predicted
plot(test$Apps, pred_lm_transform, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

It seems that the outlier data is a hiegh effect on the model.so we removed outliers in “Test” data set

trimmed_test<-test[-which(test$Apps > tukey_1),]
dim(test)
## [1] 233  18
dim(trimmed_test)
## [1] 214  18
#Model Prediction:lm_transform_Apps_3 with trimmed_test data--------
pred_lm_transform_trimmed_test <- predict(lm_transform_Apps_3, trimmed_test)
pred_lm_transform_trimmed_test<-((lambda_1*pred_lm_transform_trimmed_test)+1)^(1/lambda_1)
head(pred_lm_transform_trimmed_test,5)
##         8        13        15        18        20 
## 2072.3496 1375.0491  623.1701  804.3528 2728.4601
#Absolute error mean, median, sd, max, min with trimmed_test data-------
abs_err_lm_transform_trimmed_test <- abs(pred_lm_transform_trimmed_test - trimmed_test$Apps)
mean(abs_err_lm_transform_trimmed_test)
## [1] 498.4068
median(abs_err_lm_transform_trimmed_test)
## [1] 326.481
sd(abs_err_lm_transform_trimmed_test)
## [1] 567.9842
range(abs_err_lm_transform_trimmed_test)
## [1]    0.5630465 4022.9916692
#histogram and boxplot
hist(abs_err_lm_transform_trimmed_test, breaks = 15)

boxplot(abs_err_lm_transform_trimmed_test)

#Actual vs. Predicted
plot(trimmed_test$Apps,pred_lm_transform_trimmed_test, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model5: Using the Best Subset Selection Methods (according to adjr2)

library("leaps")
dim(train)
## [1] 543  19
#Best Subset Selection---------------------------
bestsub_1 <- regsubsets(transform_Apps~ . - Apps, nvmax = 17, data = train, method = "exhaustive")
summary(bestsub_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 17, data = train, 
##     method = "exhaustive")
## 17 Variables  (and intercept)
##                    Forced in Forced out
## PrivateNot Private     FALSE      FALSE
## Accept                 FALSE      FALSE
## Enroll                 FALSE      FALSE
## Top10perc              FALSE      FALSE
## Top25perc              FALSE      FALSE
## F.Undergrad            FALSE      FALSE
## P.Undergrad            FALSE      FALSE
## Outstate               FALSE      FALSE
## Room.Board             FALSE      FALSE
## Books                  FALSE      FALSE
## Personal               FALSE      FALSE
## PhD                    FALSE      FALSE
## Terminal               FALSE      FALSE
## S.F.Ratio              FALSE      FALSE
## perc.alumni            FALSE      FALSE
## Expend                 FALSE      FALSE
## Grad.Rate              FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
##           PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "                "*"    " "    " "       " "       " "        
## 2  ( 1 )  " "                "*"    " "    "*"       " "       " "        
## 3  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 4  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 5  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 6  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 7  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 8  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 9  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 10  ( 1 ) "*"                "*"    " "    "*"       " "       " "        
## 11  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 12  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 13  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 14  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 15  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 16  ( 1 ) "*"                "*"    "*"    "*"       " "       "*"        
## 17  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         " "      "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         " "      "*"        " "   " "      " " " "      " "      
## 6  ( 1 )  " "         " "      "*"        " "   " "      " " " "      "*"      
## 7  ( 1 )  " "         " "      "*"        " "   " "      " " " "      "*"      
## 8  ( 1 )  "*"         " "      "*"        " "   " "      " " " "      "*"      
## 9  ( 1 )  "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 10  ( 1 ) "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 11  ( 1 ) "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 12  ( 1 ) "*"         " "      "*"        "*"   " "      "*" " "      "*"      
## 13  ( 1 ) "*"         " "      "*"        "*"   "*"      "*" " "      "*"      
## 14  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" " "      "*"      
## 15  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         " "    " "      
## 2  ( 1 )  " "         " "    " "      
## 3  ( 1 )  " "         " "    " "      
## 4  ( 1 )  " "         " "    " "      
## 5  ( 1 )  " "         "*"    " "      
## 6  ( 1 )  " "         "*"    " "      
## 7  ( 1 )  " "         "*"    "*"      
## 8  ( 1 )  " "         "*"    "*"      
## 9  ( 1 )  " "         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
#Model Selection
#1.Adjusted R-squared
summary(bestsub_1)$rsq
##  [1] 0.8317380 0.8699219 0.8860704 0.8946646 0.8971094 0.9010831 0.9031614
##  [8] 0.9047911 0.9059823 0.9063966 0.9068195 0.9071595 0.9073828 0.9074286
## [15] 0.9074571 0.9074673 0.9074702
summary(bestsub_1)$adjr2
##  [1] 0.8314270 0.8694402 0.8854363 0.8938815 0.8961514 0.8999758 0.9018944
##  [8] 0.9033647 0.9043947 0.9046372 0.9048892 0.9050575 0.9051067 0.9049741
## [15] 0.9048230 0.9046526 0.9044740
#Plot Adjusted R-squared
plot(summary(bestsub_1)$adjr2,
     type = "b",
     xlab = "# of Variables", 
     ylab = "AdjR2", 
     xaxt = 'n',
     xlim = c(1, 17)); grid()
axis(1, at = 1:17, labels = 1:17)

points(which.max(summary(bestsub_1)$adjr2), 
       summary(bestsub_1)$adjr2[which.max(summary(bestsub_1)$adjr2)],
       col = "red", cex = 2, pch = 20)

#Consider Adjusted R-squared, best Model is Model with 13 variables.

#Coefficients of the best model---

coef(bestsub_1, 13) #Model w/ 13 variables
##        (Intercept) PrivateNot Private             Accept             Enroll 
##      -29.566390023       20.444832553        0.015437816        0.004064957 
##          Top10perc        P.Undergrad         Room.Board              Books 
##        0.463501309        0.001106348        0.003623007        0.011364203 
##           Personal                PhD          S.F.Ratio        perc.alumni 
##        0.001334815        0.081410586        1.083955541       -0.131028448 
##             Expend          Grad.Rate 
##        0.001101581        0.252814547
bestsub_adjr2 <- lm(transform_Apps ~Private+Accept+ Enroll+Top10perc+ P.Undergrad+Room.Board+
                  Books+Personal+PhD+S.F.Ratio+perc.alumni+Expend+Grad.Rate, data = train)
summary(bestsub_adjr2)
## 
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Enroll + Top10perc + 
##     P.Undergrad + Room.Board + Books + Personal + PhD + S.F.Ratio + 
##     perc.alumni + Expend + Grad.Rate, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.266   -9.208    0.192    8.372   71.481 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.957e+01  6.996e+00  -4.226 2.80e-05 ***
## PrivateNot Private  2.044e+01  2.461e+00   8.308 8.23e-16 ***
## Accept              1.544e-02  9.681e-04  15.946  < 2e-16 ***
## Enroll              4.065e-03  2.717e-03   1.496   0.1352    
## Top10perc           4.635e-01  6.400e-02   7.242 1.57e-12 ***
## P.Undergrad         1.106e-03  5.733e-04   1.930   0.0542 .  
## Room.Board          3.623e-03  8.812e-04   4.112 4.56e-05 ***
## Books               1.136e-02  4.762e-03   2.387   0.0174 *  
## Personal            1.335e-03  1.182e-03   1.129   0.2593    
## PhD                 8.141e-02  5.981e-02   1.361   0.1741    
## S.F.Ratio           1.084e+00  2.432e-01   4.457 1.01e-05 ***
## perc.alumni        -1.310e-01  7.681e-02  -1.706   0.0886 .  
## Expend              1.102e-03  2.088e-04   5.275 1.94e-07 ***
## Grad.Rate           2.528e-01  5.539e-02   4.564 6.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.75 on 529 degrees of freedom
## Multiple R-squared:  0.9074, Adjusted R-squared:  0.9051 
## F-statistic: 398.7 on 13 and 529 DF,  p-value: < 2.2e-16

Test the Model according to adjr2

#Prediction
pred_bestsub_adj2  <- predict(bestsub_adjr2, test)
head(pred_bestsub_adj2,5) 
##         8        13        15        18        20 
##  88.24948  71.46674  46.18745  54.67089 102.34381
pred_bestsub_adj2  <-((lambda_1*pred_bestsub_adj2)+1)^(1/lambda_1)
head(pred_bestsub_adj2,5)
##         8        13        15        18        20 
## 2036.2422 1349.3405  580.5075  802.8974 2721.9076
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_adj2 <- abs(pred_bestsub_adj2 - test$Apps)
mean(abs_err_bestsub_adj2)
## [1] 722.8981
median(abs_err_bestsub_adj2)
## [1] 378.1958
sd(abs_err_bestsub_adj2)
## [1] 1173.394
range(abs_err_bestsub_adj2)
## [1]    2.969436 7858.989501
#histogram and boxplot
hist(abs_err_bestsub_adj2, breaks = 15)

boxplot(abs_err_bestsub_adj2)

#Actual vs. Predicted
plot(test$Apps, pred_bestsub_adj2, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model6: Using the Best Subset Selection Methods (according to CP)

#Plot Cp
plot(summary(bestsub_1)$cp,
     type = "b",
     xlab = "# of Variables", 
     ylab = "Cp", 
     xaxt = 'n',
     xlim = c(1, 17)); grid()
axis(1, at = 1:17, labels = 1:17)

points(which.min(summary(bestsub_1)$cp), 
       summary(bestsub_1)$cp[which.min(summary(bestsub_1)$cp)],
       col = "red", cex = 2, pch = 20)

#Consider CP, best Model is Model with 11 variables.

#Coefficients of the best model---

coef(bestsub_1, 11) #Model w/ 11 variables
##        (Intercept) PrivateNot Private             Accept             Enroll 
##      -24.293342083       21.502050003        0.015453496        0.004206017 
##          Top10perc        P.Undergrad         Room.Board              Books 
##        0.489014606        0.001277759        0.003781491        0.011561276 
##          S.F.Ratio        perc.alumni             Expend          Grad.Rate 
##        1.079085351       -0.127580465        0.001143855        0.251775079
bestsub_CP <- lm(transform_Apps ~Private+Accept+ Enroll+Top10perc+ P.Undergrad+Room.Board+
                      Books+S.F.Ratio+perc.alumni+Expend+Grad.Rate, data = train)
summary(bestsub_CP)
## 
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Enroll + Top10perc + 
##     P.Undergrad + Room.Board + Books + S.F.Ratio + perc.alumni + 
##     Expend + Grad.Rate, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.742   -9.390    0.207    8.257   71.342 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.429e+01  6.355e+00  -3.823 0.000148 ***
## PrivateNot Private  2.150e+01  2.377e+00   9.047  < 2e-16 ***
## Accept              1.545e-02  9.650e-04  16.014  < 2e-16 ***
## Enroll              4.206e-03  2.709e-03   1.552 0.121163    
## Top10perc           4.890e-01  6.101e-02   8.016 7.01e-15 ***
## P.Undergrad         1.278e-03  5.657e-04   2.259 0.024297 *  
## Room.Board          3.782e-03  8.533e-04   4.432 1.14e-05 ***
## Books               1.156e-02  4.683e-03   2.469 0.013879 *  
## S.F.Ratio           1.079e+00  2.427e-01   4.447 1.06e-05 ***
## perc.alumni        -1.276e-01  7.557e-02  -1.688 0.091932 .  
## Expend              1.144e-03  2.075e-04   5.511 5.56e-08 ***
## Grad.Rate           2.518e-01  5.539e-02   4.546 6.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.77 on 531 degrees of freedom
## Multiple R-squared:  0.9068, Adjusted R-squared:  0.9049 
## F-statistic: 469.8 on 11 and 531 DF,  p-value: < 2.2e-16

Test the Model according to CP

#Prediction
pred_bestsub_CP  <- predict(bestsub_CP, test)
head(pred_bestsub_CP,5)
##         8        13        15        18        20 
##  87.51193  72.79531  47.11327  56.10697 102.87272
pred_bestsub_CP  <-((lambda_1*pred_bestsub_CP)+1)^(1/lambda_1)
head(pred_bestsub_CP,5)
##         8        13        15        18        20 
## 2003.0964 1398.5845  603.0283  844.1050 2749.5720
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_CP <- abs(pred_bestsub_CP - test$Apps)
mean(abs_err_bestsub_CP)
## [1] 724.3431
median(abs_err_bestsub_CP)
## [1] 363.9783
sd(abs_err_bestsub_CP)
## [1] 1174.198
range(abs_err_bestsub_CP)
## [1]    0.9072678 7946.2796208
#histogram and boxplot
hist(abs_err_bestsub_CP, breaks = 15)

boxplot(abs_err_bestsub_CP)

#Actual vs. Predicted
plot(test$Apps, pred_bestsub_CP, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model7: Using the Best Subset Selection Methods (according to BIC)

#Plot BIC
plot(summary(bestsub_1)$bic,
     type = "b",
     xlab = "# of Variables", 
     ylab = "BIC", 
     xaxt = 'n',
     xlim = c(1, 17)); grid()
axis(1, at = 1:17, labels = 1:17)

points(which.min(summary(bestsub_1)$bic), 
       summary(bestsub_1)$bic[which.min(summary(bestsub_1)$bic)],
       col = "red", cex = 2, pch = 20)

#Consider BIC, best Model is Model with 9 variables.
coef(bestsub_1, 9) #Model w/ 9 variables
##        (Intercept) PrivateNot Private             Accept          Top10perc 
##      -25.027568709       22.676676201        0.016893239        0.492189120 
##        P.Undergrad         Room.Board              Books          S.F.Ratio 
##        0.001579532        0.003545740        0.012178874        1.130147449 
##             Expend          Grad.Rate 
##        0.001100015        0.219789893
bestsub_BIC <- lm(transform_Apps ~Private+Accept+Top10perc+ P.Undergrad+Room.Board+
                   Books+S.F.Ratio+Expend+Grad.Rate, data = train)
summary(bestsub_BIC)
## 
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Top10perc + 
##     P.Undergrad + Room.Board + Books + S.F.Ratio + Expend + Grad.Rate, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.515   -9.250    0.223    8.351   74.224 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.503e+01  6.207e+00  -4.032 6.33e-05 ***
## PrivateNot Private  2.268e+01  2.319e+00   9.777  < 2e-16 ***
## Accept              1.689e-02  4.161e-04  40.598  < 2e-16 ***
## Top10perc           4.922e-01  5.837e-02   8.432 3.20e-16 ***
## P.Undergrad         1.579e-03  5.378e-04   2.937  0.00346 ** 
## Room.Board          3.546e-03  8.263e-04   4.291 2.11e-05 ***
## Books               1.218e-02  4.687e-03   2.599  0.00962 ** 
## S.F.Ratio           1.130e+00  2.416e-01   4.678 3.67e-06 ***
## Expend              1.100e-03  2.070e-04   5.314 1.58e-07 ***
## Grad.Rate           2.198e-01  5.322e-02   4.130 4.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.81 on 533 degrees of freedom
## Multiple R-squared:  0.906,  Adjusted R-squared:  0.9044 
## F-statistic: 570.7 on 9 and 533 DF,  p-value: < 2.2e-16

Test the Model according to BIC

#Prediction
pred_bestsub_BIC  <- predict(bestsub_BIC, test)
head(pred_bestsub_BIC,5) 
##         8        13        15        18        20 
##  89.04613  72.16331  47.92675  54.72223 102.46933
pred_bestsub_BIC  <-((lambda_1*pred_bestsub_BIC)+1)^(1/lambda_1)
head(pred_bestsub_BIC,5)
##         8        13        15        18        20 
## 2072.3496 1375.0491  623.1701  804.3528 2728.4601
#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_BIC <- abs(pred_bestsub_BIC - test$Apps)
mean(abs_err_bestsub_BIC)
## [1] 711.1906
median(abs_err_bestsub_BIC)
## [1] 370.7594
sd(abs_err_bestsub_BIC)
## [1] 1116.981
range(abs_err_bestsub_BIC)
## [1]    0.5630465 7773.2584894
#histogram and boxplot
hist(abs_err_bestsub_BIC, breaks = 15)

boxplot(abs_err_bestsub_BIC)

#Actual vs. Predicted
plot(test$Apps, pred_bestsub_BIC, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model8: Using the Forward and Backward Stepwise Selection

fwd_1 <- regsubsets(transform_Apps ~ . - Apps, nvmax = 17, data = train, method = "forward")

summary(fwd_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 17, data = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##                    Forced in Forced out
## PrivateNot Private     FALSE      FALSE
## Accept                 FALSE      FALSE
## Enroll                 FALSE      FALSE
## Top10perc              FALSE      FALSE
## Top25perc              FALSE      FALSE
## F.Undergrad            FALSE      FALSE
## P.Undergrad            FALSE      FALSE
## Outstate               FALSE      FALSE
## Room.Board             FALSE      FALSE
## Books                  FALSE      FALSE
## Personal               FALSE      FALSE
## PhD                    FALSE      FALSE
## Terminal               FALSE      FALSE
## S.F.Ratio              FALSE      FALSE
## perc.alumni            FALSE      FALSE
## Expend                 FALSE      FALSE
## Grad.Rate              FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "                "*"    " "    " "       " "       " "        
## 2  ( 1 )  " "                "*"    " "    "*"       " "       " "        
## 3  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 4  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 5  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 6  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 7  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 8  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 9  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 10  ( 1 ) "*"                "*"    " "    "*"       " "       " "        
## 11  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 12  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 13  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 14  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 15  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 16  ( 1 ) "*"                "*"    "*"    "*"       " "       "*"        
## 17  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         " "      "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         " "      "*"        " "   " "      " " " "      " "      
## 6  ( 1 )  " "         " "      "*"        " "   " "      " " " "      "*"      
## 7  ( 1 )  " "         " "      "*"        " "   " "      " " " "      "*"      
## 8  ( 1 )  "*"         " "      "*"        " "   " "      " " " "      "*"      
## 9  ( 1 )  "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 10  ( 1 ) "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 11  ( 1 ) "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 12  ( 1 ) "*"         " "      "*"        "*"   " "      "*" " "      "*"      
## 13  ( 1 ) "*"         " "      "*"        "*"   "*"      "*" " "      "*"      
## 14  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" " "      "*"      
## 15  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         " "    " "      
## 2  ( 1 )  " "         " "    " "      
## 3  ( 1 )  " "         " "    " "      
## 4  ( 1 )  " "         " "    " "      
## 5  ( 1 )  " "         "*"    " "      
## 6  ( 1 )  " "         "*"    " "      
## 7  ( 1 )  " "         "*"    "*"      
## 8  ( 1 )  " "         "*"    "*"      
## 9  ( 1 )  " "         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
which.max(summary(fwd_1)$adjr2)
## [1] 13
which.min(summary(fwd_1)$cp)
## [1] 11
which.min(summary(fwd_1)$bic)
## [1] 9
# Backward Stepwise Selection----------
bwd_1 <- regsubsets(transform_Apps ~ . - Apps, nvmax = 17, data = train, method = "backward")
summary(bwd_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 17, data = train, 
##     method = "backward")
## 17 Variables  (and intercept)
##                    Forced in Forced out
## PrivateNot Private     FALSE      FALSE
## Accept                 FALSE      FALSE
## Enroll                 FALSE      FALSE
## Top10perc              FALSE      FALSE
## Top25perc              FALSE      FALSE
## F.Undergrad            FALSE      FALSE
## P.Undergrad            FALSE      FALSE
## Outstate               FALSE      FALSE
## Room.Board             FALSE      FALSE
## Books                  FALSE      FALSE
## Personal               FALSE      FALSE
## PhD                    FALSE      FALSE
## Terminal               FALSE      FALSE
## S.F.Ratio              FALSE      FALSE
## perc.alumni            FALSE      FALSE
## Expend                 FALSE      FALSE
## Grad.Rate              FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: backward
##           PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "                "*"    " "    " "       " "       " "        
## 2  ( 1 )  " "                "*"    " "    "*"       " "       " "        
## 3  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 4  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 5  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 6  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 7  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 8  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 9  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 10  ( 1 ) "*"                "*"    " "    "*"       " "       " "        
## 11  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 12  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 13  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 14  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 15  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 16  ( 1 ) "*"                "*"    "*"    "*"       " "       "*"        
## 17  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         " "      "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         " "      "*"        " "   " "      " " " "      " "      
## 6  ( 1 )  " "         " "      "*"        " "   " "      " " " "      "*"      
## 7  ( 1 )  " "         " "      "*"        " "   " "      " " " "      "*"      
## 8  ( 1 )  "*"         " "      "*"        " "   " "      " " " "      "*"      
## 9  ( 1 )  "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 10  ( 1 ) "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 11  ( 1 ) "*"         " "      "*"        "*"   " "      " " " "      "*"      
## 12  ( 1 ) "*"         " "      "*"        "*"   " "      "*" " "      "*"      
## 13  ( 1 ) "*"         " "      "*"        "*"   "*"      "*" " "      "*"      
## 14  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" " "      "*"      
## 15  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         " "    " "      
## 2  ( 1 )  " "         " "    " "      
## 3  ( 1 )  " "         " "    " "      
## 4  ( 1 )  " "         " "    " "      
## 5  ( 1 )  " "         "*"    " "      
## 6  ( 1 )  " "         "*"    " "      
## 7  ( 1 )  " "         "*"    "*"      
## 8  ( 1 )  " "         "*"    "*"      
## 9  ( 1 )  " "         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
which.max(summary(bwd_1)$adjr2)
## [1] 13
which.min(summary(bwd_1)$cp)
## [1] 11
which.min(summary(bwd_1)$bic)
## [1] 9
which.max(summary(bestsub_1)$adjr2)
## [1] 13
which.min(summary(bestsub_1)$cp)
## [1] 11
which.min(summary(bestsub_1)$bic)
## [1] 9
coef(bestsub_1, 9)
##        (Intercept) PrivateNot Private             Accept          Top10perc 
##      -25.027568709       22.676676201        0.016893239        0.492189120 
##        P.Undergrad         Room.Board              Books          S.F.Ratio 
##        0.001579532        0.003545740        0.012178874        1.130147449 
##             Expend          Grad.Rate 
##        0.001100015        0.219789893
coef(fwd_1, 9)
##        (Intercept) PrivateNot Private             Accept          Top10perc 
##      -25.027568709       22.676676201        0.016893239        0.492189120 
##        P.Undergrad         Room.Board              Books          S.F.Ratio 
##        0.001579532        0.003545740        0.012178874        1.130147449 
##             Expend          Grad.Rate 
##        0.001100015        0.219789893
coef(bwd_1, 9)
##        (Intercept) PrivateNot Private             Accept          Top10perc 
##      -25.027568709       22.676676201        0.016893239        0.492189120 
##        P.Undergrad         Room.Board              Books          S.F.Ratio 
##        0.001579532        0.003545740        0.012178874        1.130147449 
##             Expend          Grad.Rate 
##        0.001100015        0.219789893
#The result for Best Subset Selection ,forward and backward is the same!----

Model9:Using the Best Subset Selection Methods Using K-fold Cross-Validation Approach

k <- 10
set.seed(123)
folds <- sample(1:k, nrow(train), rep = TRUE)
cv_errors <- matrix(NA, k, 17, dimnames = list(NULL , paste(1:17)))
#Prediction function
predict_regsubsets <- function(object, newdata, id) {
  reg_formula <- as.formula(object$call[[2]])
  mat    <- model.matrix(reg_formula, newdata)
  coef_i <- coef(object, id = id)
  mat[, names(coef_i)] %*% coef_i
}

#K-fold Cross Validation
set.seed(1234)
for(i in 1:k){
  best_fit <- regsubsets( transform_Apps~ . - Apps, data = train[folds != i,], nvmax = 17, method = "exhaustive")
  for(j in 1:17){
    pred <- predict_regsubsets(best_fit, newdata = train[folds == i,], id = j)
    cv_errors[i, j] <- mean((train$transform_Apps[folds == i] - pred) ^ 2)
  }
}

head(cv_errors,5)
##             1        2        3        4        5        6        7        8
## [1,] 382.1926 327.9241 360.3289 311.8102 317.9707 290.6581 307.9539 309.1378
## [2,] 505.1825 369.3586 337.0397 306.5241 392.4758 321.7692 326.4828 322.4399
## [3,] 365.9648 340.7436 247.2648 221.8067 227.3207 208.1610 187.9396 181.0177
## [4,] 358.9736 275.9457 218.6549 206.6556 203.4075 189.5014 185.5489 187.5722
## [5,] 953.2446 739.2839 599.8729 556.2435 583.9070 567.5726 542.7559 542.6968
##             9       10       11       12       13       14       15       16
## [1,] 293.2583 306.2096 311.9493 322.8454 319.1472 309.7009 310.1036 310.0975
## [2,] 315.0029 313.4088 307.4334 306.1009 305.0323 306.2159 309.3954 308.3572
## [3,] 168.0497 166.1811 163.4508 162.3656 161.1758 163.4316 163.4525 163.2104
## [4,] 194.9740 204.4906 205.5274 203.9576 206.6665 203.1115 203.7468 203.1323
## [5,] 536.8341 537.0849 528.9368 524.8481 517.4240 519.5601 520.1017 520.4460
##            17
## [1,] 310.1188
## [2,] 309.0627
## [3,] 163.1106
## [4,] 203.0819
## [5,] 519.6718
mean_cv_erros <- apply(cv_errors, 2, mean)
mean_cv_erros 
##        1        2        3        4        5        6        7        8 
## 507.0688 399.3343 349.9750 324.6412 337.9616 318.7059 305.2901 307.7448 
##        9       10       11       12       13       14       15       16 
## 301.3743 306.5500 304.4494 303.7560 303.1106 302.6839 302.8821 302.9425 
##       17 
## 302.9120
plot(mean_cv_erros, type = "b")

which.min(mean_cv_erros)
## 9 
## 9
#Model with 9 variables
#Coefficients of the best model
coef(bestsub_1, 9) 
##        (Intercept) PrivateNot Private             Accept          Top10perc 
##      -25.027568709       22.676676201        0.016893239        0.492189120 
##        P.Undergrad         Room.Board              Books          S.F.Ratio 
##        0.001579532        0.003545740        0.012178874        1.130147449 
##             Expend          Grad.Rate 
##        0.001100015        0.219789893
bestsub_cv_1 <- lm(transform_Apps ~Private+Accept+Top10perc+ P.Undergrad+Room.Board+
                    Books+S.F.Ratio+Expend+Grad.Rate, data = train)
summary(bestsub_cv_1)
## 
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Top10perc + 
##     P.Undergrad + Room.Board + Books + S.F.Ratio + Expend + Grad.Rate, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -107.515   -9.250    0.223    8.351   74.224 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.503e+01  6.207e+00  -4.032 6.33e-05 ***
## PrivateNot Private  2.268e+01  2.319e+00   9.777  < 2e-16 ***
## Accept              1.689e-02  4.161e-04  40.598  < 2e-16 ***
## Top10perc           4.922e-01  5.837e-02   8.432 3.20e-16 ***
## P.Undergrad         1.579e-03  5.378e-04   2.937  0.00346 ** 
## Room.Board          3.546e-03  8.263e-04   4.291 2.11e-05 ***
## Books               1.218e-02  4.687e-03   2.599  0.00962 ** 
## S.F.Ratio           1.130e+00  2.416e-01   4.678 3.67e-06 ***
## Expend              1.100e-03  2.070e-04   5.314 1.58e-07 ***
## Grad.Rate           2.198e-01  5.322e-02   4.130 4.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.81 on 533 degrees of freedom
## Multiple R-squared:  0.906,  Adjusted R-squared:  0.9044 
## F-statistic: 570.7 on 9 and 533 DF,  p-value: < 2.2e-16
#Normality of residuals
#Test for Skewness and Kurtosis
jarque.test(bestsub_cv_1$residuals)
## 
##  Jarque-Bera Normality Test
## 
## data:  bestsub_cv_1$residuals
## JB = 601.94, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(bestsub_cv_1$residuals)
## 
##  Anscombe-Glynn kurtosis test
## 
## data:  bestsub_cv_1$residuals
## kurt = 8.1514, z = 8.3056, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
# Reject Assumption of Normality of residuals
plot(bestsub_cv_1)

car :: vif(bestsub_cv_1)
##     Private      Accept   Top10perc P.Undergrad  Room.Board       Books 
##    2.091806    1.773306    2.034090    1.427575    1.555793    1.045636 
##   S.F.Ratio      Expend   Grad.Rate 
##    1.804452    2.449353    1.642820
#no multicollinearity problem.

Test the Model bestsub_cv

#Prediction
pred_bestsub_cv <- predict(bestsub_cv_1, test)
pred_bestsub_cv <- ((lambda_1*pred_bestsub_cv)+1)^(1/lambda_1)

#Absolute error mean, median, sd, max, min-------
abs_err_bestsub_cv <- abs(pred_bestsub_cv - test$Apps)
mean(abs_err_bestsub_cv)
## [1] 711.1906
median(abs_err_bestsub_cv)
## [1] 370.7594
sd(abs_err_bestsub_cv)
## [1] 1116.981
range(abs_err_bestsub_cv)
## [1]    0.5630465 7773.2584894
#histogram and boxplot
hist(abs_err_bestsub_cv, breaks = 15)

boxplot(abs_err_bestsub_cv)

#Actual vs. Predicted
plot(test$Apps, pred_bestsub_cv, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model10: Best Sub Selection Using Trimmed Train and CV

#Add transform_Apps
View(trimmed_train)
trimmed_train$transform_Apps <- ((trimmed_train$Apps ^ lambda_1) - 1) / lambda_1

trimmed_bestsub_1 <- regsubsets(transform_Apps ~ . - Apps, nvmax = 18, data = trimmed_train, method = "exhaustive")
summary(trimmed_bestsub_1)
## Subset selection object
## Call: regsubsets.formula(transform_Apps ~ . - Apps, nvmax = 18, data = trimmed_train, 
##     method = "exhaustive")
## 17 Variables  (and intercept)
##                    Forced in Forced out
## PrivateNot Private     FALSE      FALSE
## Accept                 FALSE      FALSE
## Enroll                 FALSE      FALSE
## Top10perc              FALSE      FALSE
## Top25perc              FALSE      FALSE
## F.Undergrad            FALSE      FALSE
## P.Undergrad            FALSE      FALSE
## Outstate               FALSE      FALSE
## Room.Board             FALSE      FALSE
## Books                  FALSE      FALSE
## Personal               FALSE      FALSE
## PhD                    FALSE      FALSE
## Terminal               FALSE      FALSE
## S.F.Ratio              FALSE      FALSE
## perc.alumni            FALSE      FALSE
## Expend                 FALSE      FALSE
## Grad.Rate              FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
##           PrivateNot Private Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "                "*"    " "    " "       " "       " "        
## 2  ( 1 )  " "                "*"    " "    "*"       " "       " "        
## 3  ( 1 )  "*"                "*"    " "    "*"       " "       " "        
## 4  ( 1 )  "*"                "*"    "*"    "*"       " "       " "        
## 5  ( 1 )  "*"                "*"    "*"    "*"       " "       " "        
## 6  ( 1 )  "*"                "*"    "*"    "*"       " "       " "        
## 7  ( 1 )  "*"                "*"    "*"    "*"       " "       " "        
## 8  ( 1 )  "*"                "*"    "*"    "*"       " "       " "        
## 9  ( 1 )  "*"                "*"    "*"    "*"       " "       " "        
## 10  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 11  ( 1 ) "*"                "*"    "*"    "*"       " "       " "        
## 12  ( 1 ) "*"                "*"    "*"    "*"       "*"       " "        
## 13  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
## 14  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
## 15  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
## 16  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"                "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         " "      " "        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         " "      " "        "*"   " "      " " " "      " "      
## 6  ( 1 )  " "         " "      " "        "*"   " "      " " " "      "*"      
## 7  ( 1 )  " "         " "      " "        "*"   " "      "*" " "      "*"      
## 8  ( 1 )  " "         " "      " "        "*"   " "      "*" " "      "*"      
## 9  ( 1 )  "*"         " "      " "        "*"   " "      "*" " "      "*"      
## 10  ( 1 ) "*"         " "      " "        "*"   " "      "*" " "      "*"      
## 11  ( 1 ) "*"         " "      "*"        "*"   " "      "*" " "      "*"      
## 12  ( 1 ) "*"         " "      "*"        "*"   " "      "*" " "      "*"      
## 13  ( 1 ) "*"         " "      "*"        "*"   " "      "*" " "      "*"      
## 14  ( 1 ) "*"         " "      "*"        "*"   "*"      "*" " "      "*"      
## 15  ( 1 ) "*"         "*"      "*"        "*"   " "      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         " "    " "      
## 2  ( 1 )  " "         " "    " "      
## 3  ( 1 )  " "         " "    " "      
## 4  ( 1 )  " "         " "    " "      
## 5  ( 1 )  " "         " "    " "      
## 6  ( 1 )  " "         " "    " "      
## 7  ( 1 )  " "         " "    " "      
## 8  ( 1 )  " "         " "    "*"      
## 9  ( 1 )  " "         " "    "*"      
## 10  ( 1 ) " "         "*"    "*"      
## 11  ( 1 ) " "         "*"    "*"      
## 12  ( 1 ) " "         "*"    "*"      
## 13  ( 1 ) " "         "*"    "*"      
## 14  ( 1 ) " "         "*"    "*"      
## 15  ( 1 ) " "         "*"    "*"      
## 16  ( 1 ) " "         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"
which.max(summary(trimmed_bestsub_1)$adjr2)
## [1] 12
which.min(summary(trimmed_bestsub_1)$cp)
## [1] 12
which.min(summary(trimmed_bestsub_1)$bic)
## [1] 9
#Coefficients of the best model
coef(trimmed_bestsub_1, 12) #Model w/ 12 variables
##        (Intercept) PrivateNot Private             Accept             Enroll 
##      -8.5625334646      10.5656744263       0.0289251994      -0.0114867392 
##          Top10perc          Top25perc        P.Undergrad         Room.Board 
##       0.2037194524       0.0911240813       0.0011877556       0.0009993557 
##              Books                PhD          S.F.Ratio             Expend 
##       0.0138744140       0.1008030675       0.8026544203       0.0003172184 
##          Grad.Rate 
##       0.1068480341
trimmed_bestsub_2 <- lm(transform_Apps ~Private+Accept+ Enroll+Top10perc+Top25perc+ P.Undergrad+Room.Board+
                                           Books+PhD+S.F.Ratio+Expend+Grad.Rate, data = trimmed_train)
 
summary(trimmed_bestsub_2)
## 
## Call:
## lm(formula = transform_Apps ~ Private + Accept + Enroll + Top10perc + 
##     Top25perc + P.Undergrad + Room.Board + Books + PhD + S.F.Ratio + 
##     Expend + Grad.Rate, data = trimmed_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.444  -5.803  -0.051   5.433  46.447 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -8.5625335  4.5995690  -1.862  0.06327 .  
## PrivateNot Private 10.5656744  1.6902350   6.251 9.01e-10 ***
## Accept              0.0289252  0.0009644  29.994  < 2e-16 ***
## Enroll             -0.0114867  0.0022037  -5.213 2.77e-07 ***
## Top10perc           0.2037195  0.0775878   2.626  0.00892 ** 
## Top25perc           0.0911241  0.0585006   1.558  0.11997    
## P.Undergrad         0.0011878  0.0004865   2.442  0.01498 *  
## Room.Board          0.0009994  0.0005949   1.680  0.09363 .  
## Books               0.0138744  0.0030863   4.495 8.71e-06 ***
## PhD                 0.1008031  0.0396526   2.542  0.01133 *  
## S.F.Ratio           0.8026544  0.1637948   4.900 1.31e-06 ***
## Expend              0.0003172  0.0001618   1.960  0.05052 .  
## Grad.Rate           0.1068480  0.0360550   2.963  0.00319 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.71 on 480 degrees of freedom
## Multiple R-squared:  0.9207, Adjusted R-squared:  0.9187 
## F-statistic: 464.2 on 12 and 480 DF,  p-value: < 2.2e-16

Test the Model Prediction trimmed_bestsub_2

pred_trimmed_bestsub  <- predict(trimmed_bestsub_2, test)
pred_trimmed_bestsub  <- ((lambda_1*pred_trimmed_bestsub)+1)^(1/lambda_1)
head(pred_trimmed_bestsub,5)
##         8        13        15        18        20 
## 2200.6358 1130.7939  499.0944  818.8541 2592.7988
#Absolute error mean, median, sd, max, min-------
abs_err_trimmed_bestsub <- abs(pred_trimmed_bestsub - test$Apps)
mean(abs_err_trimmed_bestsub)
## [1] 1007.595
median(abs_err_trimmed_bestsub)
## [1] 251.5226
sd(abs_err_trimmed_bestsub)
## [1] 2460.769
range(abs_err_trimmed_bestsub)
## [1] 6.774924e-01 1.983109e+04
IQR(abs_err_Tlm_trimmed)
## [1] 420.2313
#histogram and boxplot
hist(abs_err_trimmed_bestsub, breaks = 15)

boxplot(abs_err_trimmed_bestsub)

#Actual vs. Predicted
plot(test$Apps, pred_trimmed_bestsub, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model11: Ridge Regression

#Regularization
x <- model.matrix(transform_Apps ~ + . - Apps, data = train)[, -1] #remove intercept
y <- train$transform_Apps

lambda_ridge_grid <- 10 ^ seq(10, -2, length = 100)
head(lambda_ridge_grid,5)
## [1] 10000000000  7564633276  5722367659  4328761281  3274549163
#Apply Ridge Regression
library("glmnet")
ridgereg_1 <- glmnet(x, y, alpha = 0, lambda = lambda_ridge_grid)
dim(coef(ridgereg_1))
## [1]  18 100
#Cross validation to choose the best model
set.seed(1234)
ridge_cv    <- cv.glmnet(x, y, alpha = 0, nfolds = 10)
#The mean cross-validated error
ridge_cv$cvm
##   [1] 2964.9934 2941.6670 2937.5400 2934.7770 2931.7501 2928.4345 2924.8034
##   [8] 2920.8276 2916.4754 2911.7123 2906.5009 2900.8007 2894.5679 2887.7551
##  [15] 2880.3112 2872.1815 2863.3067 2853.6235 2843.0642 2831.5564 2819.0234
##  [22] 2805.3837 2790.5512 2774.4357 2756.9425 2737.9733 2717.4265 2695.1978
##  [29] 2671.1812 2645.2698 2617.3575 2587.3399 2555.1169 2520.5940 2483.6851
##  [36] 2444.3150 2402.4222 2357.9619 2310.9092 2261.2625 2209.0465 2154.3160
##  [43] 2097.1559 2037.6874 1976.0671 1912.4886 1847.1817 1780.4118 1712.4769
##  [50] 1643.7041 1574.4445 1505.0669 1435.9509 1367.4789 1300.0274 1233.9595
##  [57] 1169.6160 1107.3082 1047.3118  989.8617  935.1481  883.3151  834.4577
##  [64]  788.5903  745.7719  705.9860  669.1542  635.1842  603.9622  575.3564
##  [71]  549.2220  525.4061  503.7519  484.1018  466.2854  450.2011  435.6785
##  [78]  422.5782  410.7715  400.1381  390.5665  381.9538  374.2049  367.2329
##  [85]  360.9577  355.3014  350.2082  345.6137  341.4641  337.7110  334.3092
##  [92]  331.2227  328.4146  325.8519  323.5140  321.3863  319.4330  317.6657
##  [99]  316.0548  314.6837
#Estimate of standard error of cvm.
ridge_cv$cvsd
##   [1] 229.38531 227.84597 227.05103 226.85872 226.64801 226.41716 226.16431
##   [8] 225.88740 225.58421 225.25233 224.88912 224.49174 224.05710 223.58188
##  [15] 223.06245 222.49493 221.87516 221.19862 220.46051 219.65566 218.77858
##  [22] 217.82344 216.78405 215.65388 214.42608 213.09347 211.64860 210.08379
##  [29] 208.39114 206.56265 204.59027 202.46599 200.18201 197.73081 195.10535
##  [36] 192.29926 189.30701 186.12416 182.74755 179.17561 175.40856 171.44873
##  [43] 167.30060 162.97123 158.47039 153.81063 149.00735 144.07882 139.04608
##  [50] 133.93273 128.76464 123.56955 118.37662 113.21581 108.11724 103.11051
##  [57]  98.22398  93.48405  88.91450  84.53587  80.36499  76.41460  72.69390
##  [64]  69.20068  65.94306  62.91484  60.10898  57.51512  55.12121  52.91375
##  [71]  50.87852  49.00116  47.26772  45.66516  44.17692  42.80176  41.52693
##  [78]  40.34381  39.24591  38.22795  37.28581  36.41629  35.61702  34.88629
##  [85]  34.22291  33.62619  33.09508  32.62949  32.22913  31.89372  31.62318
##  [92]  31.41678  31.27260  31.19257  31.17202  31.21274  31.31343  31.46393
##  [99]  31.67037  31.88659
#value of lambda that gives minimum cvm
ridge_cv$lambda.min
## [1] 4.953654
#Coefficients of regression w/ best_lambda
ridgereg <- glmnet(x, y, alpha = 0, lambda = ridge_cv$lambda.min)
coef(ridgereg)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                               s0
## (Intercept)        -3.269253e+01
## PrivateNot Private  1.889714e+01
## Accept              1.012133e-02
## Enroll              1.157851e-02
## Top10perc           3.033624e-01
## Top25perc           8.463197e-02
## F.Undergrad         1.115164e-03
## P.Undergrad         6.489259e-04
## Outstate            5.146141e-04
## Room.Board          3.690088e-03
## Books               1.135417e-02
## Personal            9.137186e-04
## PhD                 1.209991e-01
## Terminal            2.252635e-03
## S.F.Ratio           9.337835e-01
## perc.alumni        -1.989425e-01
## Expend              1.011114e-03
## Grad.Rate           2.652834e-01

Test the Model Ridge Regression

#Create model matrix for test
test$transform_Apps <-((test$Apps ^ lambda_1) - 1) / lambda_1
x_test <- model.matrix(transform_Apps ~ + . - Apps, data = test)[, -1]#remove intercept
pred_ridgereg <- predict(ridgereg, s = ridge_cv$lambda.min, newx = x_test)
head(pred_ridgereg,5)
##            1
## 8   87.39334
## 13  70.89806
## 15  45.29346
## 18  56.38546
## 20 100.17449
pred_ridgereg <- ((lambda_1*pred_ridgereg)+1)^(1/lambda_1)
head(pred_ridgereg,5)
##            1
## 8  1997.7924
## 13 1328.5318
## 15  559.1679
## 18  852.2155
## 20 2609.9066
#Absolute error mean, median, sd, max, min
abs_err_ridgereg <- abs(pred_ridgereg - test$Apps)
mean(abs_err_ridgereg)
## [1] 797.11
median(abs_err_ridgereg)
## [1] 396.5723
sd(abs_err_ridgereg)
## [1] 1315.391
IQR(abs_err_ridgereg)
## [1] 625.6818
range(abs_err_ridgereg)
## [1]    1.6371 8856.8520
#Actual vs. Predicted
plot(test$Apps, pred_ridgereg, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model12: Lasso Regression

library("glmnet")
lassoreg_1 <- glmnet(x, y, alpha = 1, lambda = lambda_ridge_grid)
dim(coef(lassoreg_1))
## [1]  18 100
#Plot Reg. Coefficients vs. Log Lambda
plot(lassoreg_1, xvar = "lambda")

#Cross validation to choose the best model
set.seed(1234)
lasso_cv    <- cv.glmnet(x, y, alpha = 1, nfolds = 10)
#The mean cross-validated error
lasso_cv$cvm
##  [1] 2947.9467 2559.7017 2208.6449 1917.5853 1676.3016 1476.3108 1310.5730
##  [8] 1173.2461 1059.4824  965.2593  887.2345  822.5246  768.3007  723.2994
## [15]  684.7024  642.8826  601.0385  565.0436  534.6617  509.4137  488.2328
## [22]  469.2745  451.2316  434.9290  421.0045  408.7920  397.7708  387.5685
## [29]  378.0573  370.0615  362.9722  356.7581  350.9709  345.3854  340.1411
## [36]  334.8402  329.8254  325.3377  321.7092  318.7207  316.3989  314.4964
## [43]  312.8172  311.4153  310.1963  309.1967  308.3239  307.5644  306.9472
## [50]  306.4893  306.1187  305.7969  305.4867  305.2694  305.0969  304.9713
## [57]  304.8520  304.7640  304.7023  304.6582  304.6268  304.5971  304.5799
## [64]  304.5978  304.6209  304.6783  304.7182  304.7648
#Estimate of standard error of cvm.
lasso_cv$cvsd
##  [1] 234.42691 206.82279 174.79804 150.24636 131.43051 116.92530 105.60617
##  [8]  96.62714  89.38338  83.46327  78.59908  74.63178  71.45641  69.08312
## [15]  67.59738  65.67452  62.37739  59.74929  57.61364  55.95526  54.65781
## [22]  53.51365  52.59322  51.39216  50.16230  49.07686  48.19189  47.40230
## [29]  46.80685  46.41385  45.93875  45.51774  45.10412  44.62993  44.22872
## [36]  43.98220  43.88357  43.82384  43.83979  43.86577  43.89339  43.94365
## [43]  44.00276  44.09507  44.13005  44.20173  44.26622  44.34475  44.41912
## [50]  44.49129  44.55797  44.60591  44.59931  44.61239  44.62322  44.63199
## [57]  44.64308  44.65535  44.66612  44.67608  44.68542  44.69704  44.70659
## [64]  44.71878  44.73081  44.76461  44.77174  44.77838
#value of lambda that gives minimum cvm
lasso_cv$lambda.min
## [1] 0.1548372
#Coefficients of regression w/ best_lambda
lassoreg_2 <- glmnet(x, y, alpha = 1, lambda = lasso_cv$lambda.min)
coef(lassoreg_2)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                               s0
## (Intercept)        -2.655394e+01
## PrivateNot Private  2.037740e+01
## Accept              1.541375e-02
## Enroll              4.277179e-03
## Top10perc           4.565052e-01
## Top25perc           .           
## F.Undergrad         .           
## P.Undergrad         1.040365e-03
## Outstate            7.060850e-05
## Room.Board          3.467202e-03
## Books               1.093456e-02
## Personal            1.163522e-03
## PhD                 8.006472e-02
## Terminal            .           
## S.F.Ratio           1.012786e+00
## perc.alumni        -1.138125e-01
## Expend              1.042622e-03
## Grad.Rate           2.348864e-01

Test the Model Prediction Lasso Regression

pred_lassoreg <- predict(lassoreg_2, s = lasso_cv$lambda.min, newx = x_test)
head(pred_lassoreg,5)
##            1
## 8   88.50788
## 13  71.26303
## 15  46.45110
## 18  55.12598
## 20 101.94912
pred_lassoreg <- ((lambda_1*pred_lassoreg)+1)^(1/lambda_1)
head(pred_lassoreg,5)
##            1
## 8  2047.9192
## 13 1341.8680
## 15  586.8773
## 18  815.8445
## 20 2701.3551
#Absolute error mean, median, sd, max, min-------
abs_err_lassoreg <- abs(pred_lassoreg - test$Apps)
mean(abs_err_lassoreg)
## [1] 720.7397
median(abs_err_lassoreg)
## [1] 371.5482
sd(abs_err_lassoreg)
## [1] 1173.443
IQR(abs_err_lassoreg)
## [1] 581.7214
range(abs_err_lassoreg)
## [1]    1.616778 7866.567852

Model 13: Decision Trees

#Decision Tree Model Using All Variables
library("rpart") 
tree_1 <- rpart(formula =transform_Apps~ . - Apps, data = train, cp = 0.0001, maxdepth = 17)

#Plot the tree
library("rpart.plot")
prp(tree_1)

#Prune the tree
plotcp(tree_1)

tree_1$cptable[which.min(tree_1$cptable[,"xerror"])]
## [1] 0.0001585943
#Prune the tree
tree_2 <- prune.rpart(tree_1, 
                      cp = tree_1$cptable[which.min(tree_1$cptable[,"xerror"])])

#Plot the tree
prp(tree_2)

Test the Model Decision Tree

pred_tree  <- predict(tree_2, test)
pred_tree  <- ((lambda_1*pred_tree)+1)^(1/lambda_1)
head(pred_tree,5)
##         8        13        15        18        20 
## 2490.0515  955.6153  417.6820 1162.6730 2951.7708
#Absolute error mean, median, sd, max, min-------
abs_err_tree <- abs(pred_tree - test$Apps)
mean(abs_err_tree)
## [1] 511.7333
median(abs_err_tree)
## [1] 166.96
sd(abs_err_tree)
## [1] 977.2577
IQR(abs_err_tree)
## [1] 427.0593
range(abs_err_tree)
## [1] 9.915969e-01 1.002958e+04
#Actual vs. Predicted
plot(test$Apps, pred_tree, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model 14: Bagging

library("randomForest")
set.seed(1234)
bagging_1 <- randomForest(transform_Apps ~ . - Apps, mtry = ncol(train) - 2, ntree = 500, data = train)
bagging_1
## 
## Call:
##  randomForest(formula = transform_Apps ~ . - Apps, data = train,      mtry = ncol(train) - 2, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 17
## 
##           Mean of squared residuals: 143.4716
##                     % Var explained: 95.14

Test the Model Bagging

#Prediction: M8 Bagging
pred_bagging  <- predict(bagging_1, test)
pred_bagging  <- ((lambda_1*pred_bagging)+1)^(1/lambda_1)
head(pred_bagging,5)
##         8        13        15        18        20 
## 2310.4751  963.6219  393.0022 1142.6620 2539.5328
#Absolute error mean, median, sd, max, min-------
abs_err_bagging <- abs(pred_bagging - test$Apps)
mean(abs_err_bagging)
## [1] 483.0405
median(abs_err_bagging)
## [1] 133.405
sd(abs_err_bagging)
## [1] 905.2469
IQR(abs_err_bagging)
## [1] 483.2991
range(abs_err_bagging)
## [1]    0.6534906 9092.7464852
#Actual vs. Predicted
plot(test$Apps, pred_bagging, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model15: Random Forrest (using K-fold Cross-Validation)

set.seed(1234)
rf_1 <- randomForest(transform_Apps ~ . - Apps, data = train, ntree = 500, importance = TRUE)
rf_1
## 
## Call:
##  randomForest(formula = transform_Apps ~ . - Apps, data = train,      ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 149.2345
##                     % Var explained: 94.94
importance(rf_1)
##               %IncMSE IncNodePurity
## Private      6.826303     54012.429
## Accept      35.357239    603563.626
## Enroll      19.642033    373794.371
## Top10perc   11.473543     27159.186
## Top25perc   12.545436     36526.523
## F.Undergrad 15.744020    275972.687
## P.Undergrad  7.086237     46690.799
## Outstate    14.459781     30033.667
## Room.Board  10.123740     12799.445
## Books        1.396802      5489.845
## Personal     3.730616      6508.770
## PhD          8.138054     40276.026
## Terminal     9.019460     25454.053
## S.F.Ratio    5.812609      8956.428
## perc.alumni  5.129829      6936.498
## Expend       8.403681     24034.705
## Grad.Rate    9.947238     20333.004
varImpPlot(rf_1)

#K-fold Cross-Validation for feature selection
dim(train)
## [1] 543  19
set.seed(12345)
rf_cv <- rfcv(train[, - c(2, 19)], 
              train$transform_Apps, 
              cv.fold = 10,
              step = 0.75,
              mtry = function(p) max(1, floor(sqrt(p))), 
              recursive = FALSE)
class(rf_cv)
## [1] "list"
str(rf_cv)
## List of 3
##  $ n.var    : num [1:9] 17 13 10 7 5 4 3 2 1
##  $ error.cv : Named num [1:9] 169 164 152 151 164 ...
##   ..- attr(*, "names")= chr [1:9] "17" "13" "10" "7" ...
##  $ predicted:List of 9
##   ..$ 17: num [1:543] 224 146 59 129 154 ...
##   ..$ 13: num [1:543] 223.8 144.2 58.7 129 153.4 ...
##   ..$ 10: num [1:543] 227.6 147.1 57.4 129.9 148.2 ...
##   ..$ 7 : num [1:543] 236.4 151.4 58.3 133.9 149.9 ...
##   ..$ 5 : num [1:543] 241.1 153.4 57.9 134.8 146.8 ...
##   ..$ 4 : num [1:543] 243.4 151.6 58.1 134.6 150 ...
##   ..$ 3 : num [1:543] 236.2 152.6 59.7 140.6 148.8 ...
##   ..$ 2 : num [1:543] 249.9 171.8 57.7 145.5 159.1 ...
##   ..$ 1 : num [1:543] 264.1 198.6 57.4 129.4 162.6 ...
#Vector of number of variables used at each step
rf_cv$n.var
## [1] 17 13 10  7  5  4  3  2  1
#Corresponding vector of MSEs at each step
rf_cv$error.cv
##       17       13       10        7        5        4        3        2 
## 169.0763 164.4601 152.2640 151.0196 164.1861 177.6105 286.1920 288.5219 
##        1 
## 303.9524
which.min(rf_cv$error.cv)
## 7 
## 4
#Remove 10 variables based on Importance of Variables
sort(importance(rf_1)[,1])
##       Books    Personal perc.alumni   S.F.Ratio     Private P.Undergrad 
##    1.396802    3.730616    5.129829    5.812609    6.826303    7.086237 
##         PhD      Expend    Terminal   Grad.Rate  Room.Board   Top10perc 
##    8.138054    8.403681    9.019460    9.947238   10.123740   11.473543 
##   Top25perc    Outstate F.Undergrad      Enroll      Accept 
##   12.545436   14.459781   15.744020   19.642033   35.357239
#Regression formula
reg_formula <- as.formula(transform_Apps ~Room.Board+Top10perc+Top25perc+Outstate+F.Undergrad+Enroll+Accept) 
reg_formula
## transform_Apps ~ Room.Board + Top10perc + Top25perc + Outstate + 
##     F.Undergrad + Enroll + Accept
#mtry   
floor(sqrt(7))   
## [1] 2
set.seed(1234)
rf_2 <- randomForest(reg_formula, data = train, mtry = 2, ntree = 500)
rf_2
## 
## Call:
##  randomForest(formula = reg_formula, data = train, mtry = 2, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 137.571
##                     % Var explained: 95.34

Test the Model Random Forrest (using K-fold Cross-Validation)

#Prediction: Random Forrest
pred_rf  <- predict(rf_2, test)
pred_rf  <- ((lambda_1*pred_rf)+1)^(1/lambda_1)
head(pred_rf,5)
##         8        13        15        18        20 
## 2333.8852  978.4337  436.9443 1098.3207 2588.3653
#Absolute error mean, median, sd, max, min-------
abs_err_rf <- abs(pred_rf - test$Apps)
mean(abs_err_rf)
## [1] 439.2642
median(abs_err_rf)
## [1] 176.5174
sd(abs_err_rf)
## [1] 693.1222
IQR(abs_err_rf)
## [1] 453.6405
range(abs_err_rf)
## [1]    1.055278 6566.613865
#Actual vs. Predicted
plot(test$Apps, pred_rf, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model 16: Bagging with Trimmed Train

set.seed(1234)
trimmedbagging_1 <- randomForest(transform_Apps ~ . - Apps, mtry = ncol(trimmed_train) - 2, ntree = 500, data = trimmed_train)
trimmedbagging_1
## 
## Call:
##  randomForest(formula = transform_Apps ~ . - Apps, data = trimmed_train,      mtry = ncol(trimmed_train) - 2, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 17
## 
##           Mean of squared residuals: 78.25712
##                     % Var explained: 94.45

Test the Model Bagging with Trimmed Train

#Prediction Bagging
pred_trimmedbagging  <- predict(trimmedbagging_1, test)
pred_trimmedbagging  <- ((lambda_1*pred_trimmedbagging)+1)^(1/lambda_1)
head(pred_trimmedbagging,5)
##         8        13        15        18        20 
## 2317.5671  960.4946  372.0066 1145.9394 2582.6366
#Absolute error mean, median, sd, max, min-------
abs_err_trimmedbagging <- abs(pred_trimmedbagging - test$Apps)
mean(abs_err_trimmedbagging)
## [1] 735.7105
median(abs_err_trimmedbagging)
## [1] 147.8909
sd(abs_err_trimmedbagging)
## [1] 1634.94
IQR(abs_err_trimmedbagging)
## [1] 441.6711
range(abs_err_trimmedbagging)
## [1]    0.3129347 9394.1215224
#Actual vs. Predicted
plot(test$Apps, pred_trimmedbagging, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model 17: GB Regression

library("gbm")
library("xgboost")
library("ggplot2")
set.seed(123)
#train GBM model
gbm_1 <- gbm(formula =transform_Apps ~ . - Apps,
             distribution = "gaussian",
             data = train,
             n.trees = 10000, #the total number of trees to fit
             interaction.depth = 1, #1: stump, the maximum depth of each tree 
             shrinkage = 0.001, #learning rate
             cv.folds = 5, #Number of cross-validation folds to perform
             n.cores = NULL, #will use all cores by default
             verbose = FALSE)  

#get MSE and compute RMSE
min(gbm_1$cv.error)         #MSE
## [1] 158.3526
sqrt(min(gbm_1$cv.error))   #RMSE
## [1] 12.58382
#plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm_1, method = "cv")

## [1] 9999
#returns the estimated optimal number of iterations

#Use different parameters
set.seed(123)
#Tuning
#Create hyper-parameter grid
par_grid <- expand.grid(shrinkage = c(0.001, 0.005, 0.01),
                        interaction_depth = c(1, 3, 5), 
                        n_minobsinnode = c(5, 10, 15),  
                        bag_fraction = c(0.5, 0.7, 0.9)  
)
nrow(par_grid)
## [1] 81
#Grid search (train/validation approach)
for(i in 1:nrow(par_grid)) {
  set.seed(123)
  #train model
  gbm_tune <- gbm(formula = transform_Apps ~ . - Apps,
                  distribution = "gaussian",
                  data = train,
                  n.trees = 5000,
                  interaction.depth = par_grid$interaction_depth[i],
                  shrinkage = par_grid$shrinkage[i],
                  n.minobsinnode = par_grid$n_minobsinnode[i],
                  bag.fraction = par_grid$bag_fraction[i],
                  train.fraction = 0.8,
                  cv.folds = 0,
                  n.cores = NULL, #will use all cores by default
                  verbose = FALSE)  
  #add min training error and trees to grid
  par_grid$optimal_trees[i] <- which.min(gbm_tune$valid.error)
  par_grid$min_RMSE[i]      <- sqrt(min(gbm_tune$valid.error))
}
knitr::kable(head(par_grid,5))
shrinkage interaction_depth n_minobsinnode bag_fraction optimal_trees min_RMSE
0.001 1 5 0.5 5000 11.621141
0.005 1 5 0.5 4729 9.039213
0.010 1 5 0.5 4790 8.983764
0.001 3 5 0.5 5000 8.700245
0.005 3 5 0.5 2748 8.304058
#Modify hyper-parameter grid
par_grid <- expand.grid(shrinkage = c(0.005, 0.007, 0.01),
                        interaction_depth = c(3, 4, 5),
                        n_minobsinnode = c(4, 5, 6),
                        bag_fraction = c(0.5, 0.5, 0.7)  #stochastic gradient :bag.fraction < 1
)

#Grid search 
for(i in 1:nrow(par_grid)) {
  set.seed(123)
  #train model
  gbm_tune <- gbm(formula = transform_Apps ~ . - Apps,
                  distribution = "gaussian",
                  data = train,
                  n.trees = 5000,
                  interaction.depth = par_grid$interaction_depth[i],
                  shrinkage = par_grid$shrinkage[i],
                  n.minobsinnode = par_grid$n_minobsinnode[i],
                  bag.fraction = par_grid$bag_fraction[i],
                  train.fraction = 0.8,
                  cv.folds = 0,
                  n.cores = NULL, #will use all cores by default
                  verbose = FALSE)  
  #add min training error and trees to grid
  par_grid$optimal_trees[i] <- which.min(gbm_tune$valid.error)
  par_grid$min_RMSE[i]    <- sqrt(min(gbm_tune$valid.error))
}

knitr::kable(head(par_grid,5))
shrinkage interaction_depth n_minobsinnode bag_fraction optimal_trees min_RMSE
0.005 3 4 0.5 3262 8.284006
0.007 3 4 0.5 2621 8.366324
0.010 3 4 0.5 2071 8.304693
0.005 4 4 0.5 4358 8.315379
0.007 4 4 0.5 1573 8.379029
#Final Model
gbm_3 <- gbm(formula =transform_Apps ~ . - Apps,
             distribution = "gaussian",
             data = train,
             n.trees = 2500,
             interaction.depth = 5,
             shrinkage =0.010,
             n.minobsinnode = 5,
             bag.fraction = 0.5,
             train.fraction = 0.8,
             cv.folds = 0,
             n.cores = NULL, #will use all cores by default
)  

summary(gbm_3)

##                     var     rel.inf
## Accept           Accept 85.27310170
## Top10perc     Top10perc  2.43175401
## F.Undergrad F.Undergrad  2.34824064
## Top25perc     Top25perc  2.07351122
## Enroll           Enroll  1.68191087
## Outstate       Outstate  1.02719088
## Grad.Rate     Grad.Rate  0.89932108
## Room.Board   Room.Board  0.71521635
## Expend           Expend  0.67491768
## PhD                 PhD  0.59102565
## Books             Books  0.46780972
## P.Undergrad P.Undergrad  0.44227437
## perc.alumni perc.alumni  0.43011173
## S.F.Ratio     S.F.Ratio  0.36160514
## Personal       Personal  0.30870081
## Terminal       Terminal  0.23792353
## Private         Private  0.03538463

Test the Model GB Regression

#Model: gbm_3
#Prediction
pred_gbm <- predict(gbm_3, n.trees = 2500, newdata = test)
head(pred_gbm,5)
## [1] 93.02486 58.26126 37.75944 62.04406 98.87365
pred_gbm <- ((lambda_1*pred_gbm)+1)^(1/lambda_1)
head(pred_gbm,5)
## [1] 2257.4308  907.8548  395.2033 1025.4105 2543.8731
#Absolute error mean, median, sd, max, min-------
abs_err_gbm <- abs(pred_gbm - test$Apps)
mean(abs_err_gbm)
## [1] 459.865
median(abs_err_gbm)
## [1] 150.3577
sd(abs_err_gbm)
## [1] 769.8837
IQR(abs_err_gbm)
## [1] 412.7232
range(abs_err_gbm)
## [1] 8.244238e-02 5.293883e+03
#Actual vs. Predicted
plot(test$Apps, pred_gbm, xlab = "Actual", ylab = "Prediction")
abline(a = 0, b = 1, col = "red", lwd = 2)

Model 18: XGBoost Regression

x <- model.matrix(transform_Apps ~ . - Apps, data = train)[, -1] #remove intercept
y <- train$transform_Apps

set.seed(123)
xgb_1 <- xgboost(data = x, 
                 label = y,
                 eta = 0.1,                       #learning rate
                 lambda = 0,                      #regularization term
                 max_depth = 8,                   #tree depth 
                 nround = 1000,                   #max number of boosting iterations
                 subsample = 0.65,                #percent of training data to sample for each tree
                 objective = "reg:squarederror",  #for regression models
                 verbose = 0                      #silent
) 

#train RMSE
xgb_1$evaluation_log
##       iter train_rmse
##    1:    1  96.346611
##    2:    2  86.847481
##    3:    3  78.186195
##    4:    4  70.431076
##    5:    5  63.530521
##   ---                
##  996:  996   0.000425
##  997:  997   0.000425
##  998:  998   0.000425
##  999:  999   0.000425
## 1000: 1000   0.000425
#plot error vs number trees
ggplot(xgb_1$evaluation_log) +
  geom_line(aes(iter, train_rmse), color = "red") 

#Tuning(Train/validation using xgboost)
#Train and validation sets
set.seed(1234)
train_cases <- sample(1:nrow(train), nrow(train) * 0.8)
#Train data set
train_xgboost <- train[train_cases,]
dim(train_xgboost)
## [1] 434  19
#Model Matrix
xtrain <- model.matrix(transform_Apps ~ . - Apps, data = train_xgboost)[, -1] #remove intercept
ytrain <- train_xgboost$transform_Apps

#Validation data set
validation_xgboost  <- train[- train_cases,]
dim(validation_xgboost)
## [1] 109  19
xvalidation <- model.matrix(transform_Apps ~ . - Apps, data = validation_xgboost)[, -1] #remove intercept
yvalidation <- validation_xgboost$transform_Apps

#Create hyper-parameter grid
par_grid <- expand.grid(eta = c(0.01, 0.05, 0.1, 0.3),
                        lambda = c(0, 1, 2, 5),
                        max_depth = c(1, 3, 5, 7),
                        subsample = c(0.65, 0.8, 1), 
                        colsample_bytree = c(0.8, 0.9, 1))

dim(par_grid)
## [1] 576   5
#Grid search 
for(i in 1:nrow(par_grid )) {
  set.seed(123)
  
  #train model
  xgb_tune <- xgboost(data =  xtrain,
                      label = ytrain,
                      eta = par_grid$eta[i],
                      max_depth = par_grid$max_depth[i],
                      subsample = par_grid$subsample[i],
                      colsample_bytree = par_grid$colsample_bytree[i],
                      nrounds = 1000,
                      objective = "reg:squarederror",  #for regression models
                      verbose = 0,                     #silent,
                      early_stopping_rounds = 10       #stop if no improvement for 10 consecutive trees
  )
  
  #prediction on validation data set
  pred_xgb_validation <- predict(xgb_tune, xvalidation)
  rmse <- sqrt(mean((yvalidation - pred_xgb_validation) ^ 2))
  
  #add validation error
  par_grid$RMSE[i]  <- rmse
}

knitr::kable(head(par_grid,5))
eta lambda max_depth subsample colsample_bytree RMSE
0.01 0 1 0.65 0.8 12.27853
0.05 0 1 0.65 0.8 11.85737
0.10 0 1 0.65 0.8 12.39560
0.30 0 1 0.65 0.8 14.49645
0.01 1 1 0.65 0.8 12.27853
#Final Model
set.seed(123)
xgb_2 <- xgboost(data = x, 
                 label = y,
                 eta = 0.05,     #learning rate
                 max_depth = 3,  #tree depth 
                 lambda = 0,
                 nround = 1000,
                 colsample_bytree = 0.8,
                 subsample = 0.65,                 #percent of training data to sample for each tree
                 objective = "reg:squarederror",  #for regression models
                 verbose = 0                      #silent
)

Test the Model XGBoost Regression

#Model: xgb_2
x_test <- model.matrix(transform_Apps ~ . - Apps, data = test)[, -1]#remove intercept
pred_xgb <- predict(xgb_2, x_test)
head(pred_xgb,5)
## [1] 91.04086 59.17511 38.16672 65.59778 98.59235
pred_xgb <- ((lambda_1*pred_xgb)+1)^(1/lambda_1)
head(pred_xgb,5)
## [1] 2164.1505  935.5984  403.3414 1142.3649 2529.7054
#Absolute error mean, median, sd, max, min-------

abs_err_xgb <- abs(pred_xgb - test$Apps)
mean(abs_err_xgb)
## [1] 447.7819
median(abs_err_xgb)
## [1] 183.3197
sd(abs_err_xgb)
## [1] 699.316
IQR(abs_err_xgb)
## [1] 437.3617
range(abs_err_xgb)
## [1]    4.182853 5387.825432

Assessment:

  1. Model Comparison +Comparison of models that reached the test stage.
df <- data.frame("Model_1" = abs_err_Tlm, 
                 "Model_2" = abs_err_Tlm_trimmed, 
                 "Model_3" = abs_err_lm_transform,
                 "Model_4" = abs_err_bestsub_adj2, 
                 "Model_5" =abs_err_bestsub_CP, 
                 "Model_6" =abs_err_bestsub_BIC,
                 "Model_7" =abs_err_trimmed_bestsub,
                 "Model_8" =abs_err_ridgereg,
                 "Model_9" =abs_err_lassoreg,
                 "Model_10" =abs_err_tree,
                 "Model_11" =abs_err_bagging,
                 "Model_11" =abs_err_rf,
                 "Model_13" =abs_err_trimmedbagging,
                 "Model_14" =abs_err_gbm,
                 "Model_15" =abs_err_xgb )
models_comp <- data.frame("Mean of AbsErrors"   = apply(df, 2, mean),
                          "Median of AbsErrors" = apply(df, 2, median),
                          "SD of AbsErrors"  = apply(df, 2, sd),
                          "IQR of AbsErrors" = apply(df, 2, IQR),
                          "Min of AbsErrors" = apply(df, 2, min),
                          "Max of AbsErrors" = apply(df, 2, max))
        

rownames(models_comp) <- c("Tlm", 
                           "Tlm_trimmed", 
                           "lm_transform",
                           "bestsub_adj2", 
                           "bestsub_CP", 
                           "bestsub_BIC",
                           "trimmed_bestsub",
                           "ridgereg",
                           "Lassoreg",
                           "Tree",
                           "bagging",
                           "RandomFarest",
                          "Trimmedbagging",
                          "Gbm",
                          "Xgb ")



models_comp <- rbind(models_comp, "lm_transform_trimmed_test" = c(mean(abs_err_lm_transform_trimmed_test),
                                                      median(abs_err_lm_transform_trimmed_test),
                                                      sd(abs_err_lm_transform_trimmed_test),
                                                      IQR(abs_err_lm_transform_trimmed_test),
                                                      range(abs_err_lm_transform_trimmed_test)))
models_comp<-models_comp[c(1,2,3,16,4,5,6,7,8,9,10,11,12,13,14,15),]
library(knitr)
kable(models_comp)
Mean.of.AbsErrors Median.of.AbsErrors SD.of.AbsErrors IQR.of.AbsErrors Min.of.AbsErrors Max.of.AbsErrors
Tlm 599.7741 375.7187 734.9941 567.6974 2.0195353 6893.422
Tlm_trimmed 511.9436 253.6845 910.2706 420.2313 4.9054292 9327.358
lm_transform 711.1906 370.7594 1116.9812 578.1689 0.5630465 7773.258
lm_transform_trimmed_test 498.4068 326.4810 567.9842 528.3034 0.5630465 4022.992
bestsub_adj2 722.8981 378.1958 1173.3939 593.8777 2.9694363 7858.990
bestsub_CP 724.3431 363.9783 1174.1980 581.8620 0.9072678 7946.280
bestsub_BIC 711.1906 370.7594 1116.9812 578.1689 0.5630465 7773.258
trimmed_bestsub 1007.5954 251.5226 2460.7692 500.0060 0.6774924 19831.086
ridgereg 797.1100 396.5723 1315.3914 625.6818 1.6371001 8856.852
Lassoreg 720.7397 371.5482 1173.4429 581.7214 1.6167782 7866.568
Tree 511.7333 166.9600 977.2577 427.0593 0.9915969 10029.575
bagging 483.0405 133.4050 905.2469 483.2991 0.6534906 9092.746
RandomFarest 439.2642 176.5174 693.1222 453.6405 1.0552783 6566.614
Trimmedbagging 735.7105 147.8909 1634.9395 441.6711 0.3129347 9394.122
Gbm 459.8650 150.3577 769.8837 412.7232 0.0824424 5293.883
Xgb 447.7819 183.3197 699.3160 437.3617 4.1828526 5387.825

according to different index such as: Mean of AbsErrors , Median of AbsErrors, SD of AbsErrors, IQR of AbsErrors best Model is different, for example: the Random Forest model is the best model in terms of Mean.of.Abs.Errors, while the Median.of.Abs.Errors model works better in the bagging method.

.

  1. What suggestions do you have for testing the results in a real environment?

Establishment:

  1. Examine the challenges of algorithm development?
  1. What solutions do you have to solve them?
  1. What requirements do you need to provide those solutions?

conclusion:

  1. What did learning this project teach you?
  1. What challenges did you face? How did you solve them?

Almost all linear regression models built on the training data set violated regression assumptions including normality of Residuals. Therefore, we removed one Observation data that was in several other variables is also Outlier, which solved the Leverage problem. Then we Transform the dependent variable using the Box-Cox method, However, does not solve the problem of violating the Residuals Normality assumption. that makes the predictions in such Models impossible to generalize.

For this reason, we went to other methods such as Stepwise Regression, Random Forest, etc., whose purpose is only data-based prediction, and it is not necessary to check the Residuals Normality assumption for that kinds of Models.

END of THE CODE.